Coder Social home page Coder Social logo

maartengr / bertopic Goto Github PK

View Code? Open in Web Editor NEW
5.6K 49.0 696.0 20.5 MB

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Home Page: https://maartengr.github.io/BERTopic/

License: MIT License

Python 99.93% Makefile 0.07%
bert transformers topic-modeling sentence-embeddings nlp machine-learning topic ldavis topic-modelling topic-models

bertopic's Introduction

PyPI - Python Build docs PyPI - PyPi PyPI - License arXiv

BERTopic

BERTopic is a topic modeling technique that leverages ๐Ÿค— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports all kinds of topic modeling techniques:

Guided Supervised Semi-supervised
Manual Multi-topic distributions Hierarchical
Class-based Dynamic Online/Incremental
Multimodal Multi-aspect Text Generation/LLM
Zero-shot (new!) Merge Models (new!) Seed Words (new!)

Corresponding medium posts can be found here, here and here. For a more detailed overview, you can read the paper or see a brief overview.

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

If you want to install BERTopic with other embedding models, you can choose one of the following:

# Choose an embedding backend
pip install bertopic[flair,gensim,spacy,use]

# Topic modeling with images
pip install bertopic[vision]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation or you can follow along with one of the examples below:

Name Link
Start Here - Best Practices in BERTopic Open In Colab
๐Ÿ†• New! - Topic Modeling on Large Data (GPU Acceleration) Open In Colab
๐Ÿ†• New! - Topic Modeling with Llama 2 ๐Ÿฆ™ Open In Colab
๐Ÿ†• New! - Topic Modeling with Quantized LLMs Open In Colab
Topic Modeling with BERTopic Open In Colab
(Custom) Embedding Models in BERTopic Open In Colab
Advanced Customization in BERTopic Open In Colab
(semi-)Supervised Topic Modeling with BERTopic Open In Colab
Dynamic Topic Modeling with Trump's Tweets Open In Colab
Topic Modeling arXiv Abstracts Kaggle

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access all of the topics together with their topic representations:

>>> topic_model.get_topic_info()

Topic	Count	Name
-1	4630	-1_can_your_will_any
0	693	49_windows_drive_dos_file
1	466	32_jesus_bible_christian_faith
2	441	2_space_launch_orbit_lunar
3	381	22_key_encryption_keys_encrypted
...

The -1 topic refers to all outlier documents and are typically ignored. Each word in a topic describes the underlying theme of that topic and can be used for interpreting that topic. Next, let's take a look at the most frequent topic that was generated:

>>> topic_model.get_topic(0)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

Using .get_document_info, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:

>>> topic_model.get_document_info(docs)

Document                               Topic	Name	                        Top_n_words                     Probability    ...
I am sure some bashers of Pens...	0	0_game_team_games_season	game - team - games...	        0.200010       ...
My brother is in the market for...      -1     -1_can_your_will_any	        can - your - will...	        0.420668       ...
Finally you said what you dream...	-1     -1_can_your_will_any	        can - your - will...            0.807259       ...
Think! It's the SCSI card doing...	49     49_windows_drive_dos_file	windows - drive - docs...	0.071746       ...
1) I have an old Jasmine drive...	49     49_windows_drive_dos_file	windows - drive - docs...	0.038983       ...

๐Ÿ”ฅ Tip: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Fine-tune Topic Representations

In BERTopic, there are a number of different topic representations that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is KeyBERTInspired, which for many users increases the coherence and reduces stopwords from the resulting topic representations:

from bertopic.representation import KeyBERTInspired

# Fine-tune your topic representations
representation_model = KeyBERTInspired()
topic_model = BERTopic(representation_model=representation_model)

However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:

import openai
from bertopic.representation import OpenAI

# Fine-tune topic representations with GPT
client = openai.OpenAI(api_key="sk-...")
representation_model = OpenAI(client, model="gpt-3.5-turbo", chat=True)
topic_model = BERTopic(representation_model=representation_model)

๐Ÿ”ฅ Tip: Instead of iterating over all of these different topic representations, you can model them simultaneously with multi-aspect topic representations in BERTopic.

Visualizations

After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the many visualization options in BERTopic. For example, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

Modularity

By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:

BERTopicOverview.mp4

You can swap out any of these models or even remove them entirely. The following steps are completely modular:

  1. Embedding documents
  2. Reducing dimensionality of embeddings
  3. Clustering reduced embeddings into topics
  4. Tokenization of topics
  5. Weight tokens
  6. Represent topics with one or multiple representations

Functionality

BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview of all methods and a short description of its purpose.

Common

Below, you will find an overview of common functions in BERTopic.

Method Code
Fit the model .fit(docs)
Fit the model and predict documents .fit_transform(docs)
Predict new documents .transform([new_doc])
Access single topic .get_topic(topic=12)
Access all topics .get_topics()
Get topic freq .get_topic_freq()
Get all topic information .get_topic_info()
Get all document information .get_document_info(docs)
Get representative docs per topic .get_representative_docs()
Update topic representation .update_topics(docs, n_gram_range=(1, 3))
Generate topic labels .generate_topic_labels()
Set topic labels .set_topic_labels(my_custom_labels)
Merge topics .merge_topics(docs, topics_to_merge)
Reduce nr of topics .reduce_topics(docs, nr_topics=30)
Reduce outliers .reduce_outliers(docs, topics)
Find topics .find_topics("vehicle")
Save model .save("my_model", serialization="safetensors")
Load model BERTopic.load("my_model")
Get parameters .get_params()

Attributes

After having trained your BERTopic model, several attributes are saved within your model. These attributes, in part, refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in _ and are public attributes that can be used to access model information.

Attribute Description
.topics_ The topics that are generated for each document after training or updating the topic model.
.probabilities_ The probabilities that are generated for each document if HDBSCAN is used.
.topic_sizes_ The size of each topic
.topic_mapper_ A class for tracking topics and their mappings anytime they are merged/reduced.
.topic_representations_ The top n terms per topic and their respective c-TF-IDF values.
.c_tf_idf_ The topic-term matrix as calculated through c-TF-IDF.
.topic_aspects_ The different aspects, or representations, of each topic.
.topic_labels_ The default labels for each topic.
.custom_labels_ Custom labels for each topic as generated through .set_topic_labels.
.topic_embeddings_ The embeddings for each topic if embedding_model was used.
.representative_docs_ The representative documents for each topic if HDBSCAN is used.

Variations

There are many different use cases in which topic modeling can be used. As such, several variations of BERTopic have been developed such that one package can be used across many use cases.

Method Code
Topic Distribution Approximation .approximate_distribution(docs)
Online Topic Modeling .partial_fit(doc)
Semi-supervised Topic Modeling .fit(docs, y=y)
Supervised Topic Modeling .fit(docs, y=y)
Manual Topic Modeling .fit(docs, y=y)
Multimodal Topic Modeling .fit(docs, images=images)
Topic Modeling per Class .topics_per_class(docs, classes)
Dynamic Topic Modeling .topics_over_time(docs, timestamps)
Hierarchical Topic Modeling .hierarchical_topics(docs)
Guided Topic Modeling BERTopic(seed_topic_list=seed_topic_list)
Zero-shot Topic Modeling BERTopic(zeroshot_topic_list=zeroshot_topic_list)
Merge Multiple Models BERTopic.merge_models([topic_model_1, topic_model_2])

Visualizations

Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation. Visualizing different aspects of the topic model helps in understanding the model and makes it easier to tweak the model to your liking.

Method Code
Visualize Topics .visualize_topics()
Visualize Documents .visualize_documents()
Visualize Document Hierarchy .visualize_hierarchical_documents()
Visualize Topic Hierarchy .visualize_hierarchy()
Visualize Topic Tree .get_topic_tree(hierarchical_topics)
Visualize Topic Terms .visualize_barchart()
Visualize Topic Similarity .visualize_heatmap()
Visualize Term Score Decline .visualize_term_rank()
Visualize Topic Probability Distribution .visualize_distribution(probs[0])
Visualize Topics over Time .visualize_topics_over_time(topics_over_time)
Visualize Topics per Class .visualize_topics_per_class(topics_per_class)

Citation

To cite the BERTopic paper, please use the following bibtex reference:

@article{grootendorst2022bertopic,
  title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
  author={Grootendorst, Maarten},
  journal={arXiv preprint arXiv:2203.05794},
  year={2022}
}

bertopic's People

Contributors

agamble avatar ananaphasia avatar anubhabdaserrr avatar aratako avatar atmb4u avatar bobchien avatar chrisji avatar davanstrien avatar dkapitan avatar domenicrosati avatar dschwalm avatar dwhdai avatar elashrry avatar felsiq avatar jiaxin-wen avatar joouha avatar joshuasundance-swca avatar lawrencefulton avatar leloykun avatar liaoelton avatar lmcinnes avatar luisoala avatar maartengr avatar mertyyanik avatar nicholsonjf avatar nreimers avatar peguerosdc avatar proselotis avatar snape avatar zilch42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bertopic's Issues

Problem with numba package

I keep having the same installation error, related to the numba package. See error below:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-160-1d2f5f7c9d67> in <module>
----> 1 from bertopic import BERTopic

/opt/anaconda3/lib/python3.7/site-packages/bertopic/__init__.py in <module>
----> 1 from bertopic._bertopic import BERTopic
      2 from bertopic._ctfidf import ClassTFIDF
      3 from bertopic._embeddings import languages
      4 
      5 __version__ = "0.4.3"

/opt/anaconda3/lib/python3.7/site-packages/bertopic/_bertopic.py in <module>
     10 
     11 # Models
---> 12 import umap
     13 import hdbscan
     14 from sentence_transformers import SentenceTransformer

/opt/anaconda3/lib/python3.7/site-packages/umap/__init__.py in <module>
      1 from warnings import warn, catch_warnings, simplefilter
----> 2 from .umap_ import UMAP
      3 
      4 try:
      5     with catch_warnings():

/opt/anaconda3/lib/python3.7/site-packages/umap/umap_.py in <module>
     45 )
     46 
---> 47 from pynndescent import NNDescent
     48 from pynndescent.distances import named_distances as pynn_named_distances
     49 from pynndescent.sparse import sparse_named_distances as pynn_sparse_named_distances

/opt/anaconda3/lib/python3.7/site-packages/pynndescent/__init__.py in <module>
      1 import pkg_resources
      2 import numba
----> 3 from .pynndescent_ import NNDescent, PyNNDescentTransformer
      4 
      5 # Workaround: https://github.com/numba/numba/issues/3341

/opt/anaconda3/lib/python3.7/site-packages/pynndescent/pynndescent_.py in <module>
     19 import heapq
     20 
---> 21 import pynndescent.sparse as sparse
     22 import pynndescent.sparse_nndescent as sparse_nnd
     23 import pynndescent.distances as pynnd_dist

/opt/anaconda3/lib/python3.7/site-packages/pynndescent/sparse.py in <module>
      8 import numba
      9 
---> 10 from pynndescent.utils import norm, tau_rand
     11 from pynndescent.distances import kantorovich
     12 

/opt/anaconda3/lib/python3.7/site-packages/pynndescent/utils.py in <module>
      6 
      7 import numba
----> 8 from numba.core import types
      9 from numba.experimental import structref
     10 import numpy as np

ModuleNotFoundError: No module named 'numba.core'

I am running on macOS Big Sur. Package versions:
bertopic==0.4.3
conda==4.9.2
numba==0.52.0
umap-learn==0.5.0
Python==3.7.6

I've already done a lot of searching on the internet but can't find any solution. Does somebody have the same problem or any idea how to solve this?

Thanks in advance!

TypingError : Failed in nopython mode pipeline (step: nopython frontend)

I'm trying to load a trained BERTopic model from disk by using BERTopic.load, but I'm getting this error:

TypingError                               Traceback (most recent call last)
<ipython-input-9-2081de8232b3> in <module>
      1 import joblib
      2 with open('bertopic_model', 'rb') as file:
----> 3     model=joblib.load(file)

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/joblib/numpy_pickle.py in load(filename, mmap_mode)
    573         filename = getattr(fobj, 'name', '')
    574         with _read_fileobject(fobj, filename, mmap_mode) as fobj:
--> 575             obj = _unpickle(fobj)
    576     else:
    577         with open(filename, 'rb') as f:

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/joblib/numpy_pickle.py in _unpickle(fobj, filename, mmap_mode)
    502     obj = None
    503     try:
--> 504         obj = unpickler.load()
    505         if unpickler.compat_mode:
    506             warnings.warn("The file '%s' has been generated with a "

/usr/lib/python3.8/pickle.py in load(self)
   1208                     raise EOFError
   1209                 assert isinstance(key, bytes_types)
-> 1210                 dispatch[key[0]](self)
   1211         except _Stop as stopinst:
   1212             return stopinst.value

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/joblib/numpy_pickle.py in load_build(self)
    327         NDArrayWrapper is used for backward compatibility with joblib <= 0.9.
    328         """
--> 329         Unpickler.load_build(self)
    330 
    331         # For backward compatibility, we support NDArrayWrapper objects.

/usr/lib/python3.8/pickle.py in load_build(self)
   1701         setstate = getattr(inst, "__setstate__", None)
   1702         if setstate is not None:
-> 1703             setstate(state)
   1704             return
   1705         slotstate = None

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/pynndescent/pynndescent_.py in __setstate__(self, d)
   1026     def __setstate__(self, d):
   1027         self.__dict__ = d
-> 1028         self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
   1029 
   1030     def _init_search_graph(self):

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/pynndescent/pynndescent_.py in <listcomp>(.0)
   1026     def __setstate__(self, d):
   1027         self.__dict__ = d
-> 1028         self._rp_forest = tuple([renumbaify_tree(tree) for tree in d["_rp_forest"]])
   1029 
   1030     def _init_search_graph(self):

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/pynndescent/rp_trees.py in renumbaify_tree(tree)
   1176     point_indices = numba.typed.List.empty_list(point_indices_type)
   1177 
-> 1178     hyperplanes.extend(tree.hyperplanes)
   1179     offsets.extend(tree.offsets)
   1180     children.extend(tree.children)

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/typedlist.py in extend(self, iterable)
    364             # can not be sliced.
    365             self._initialise_list(iterable[0])
--> 366         return _extend(self, iterable)
    367 
    368     def remove(self, item):

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    413                 e.patch_message(msg)
    414 
--> 415             error_rewrite(e, 'typing')
    416         except errors.UnsupportedError as e:
    417             # Something unsupported is present in the user code, add help info

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
    356                 raise e
    357             else:
--> 358                 reraise(type(e), e, None)
    359 
    360         argtypes = []

/mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/utils.py in reraise(tp, value, tb)
     78         value = tp()
     79     if value.__traceback__ is not tb:
---> 80         raise value.with_traceback(tb)
     81     raise value
     82 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_extend at 0x7f2a3dc6f4c0>) found for signature:

 >>> impl_extend(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C))<iv=None>)

There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'impl_extend': File: numba/typed/listobject.py: Line 1027.
    With argument(s): '(ListType[array(float64, 2d, C)], reflected list(array(float32, 1d, C))<iv=None>)':
   Rejected as the implementation raised a specific error:
     TypingError: Failed in nopython mode pipeline (step: nopython frontend)
   - Resolution failure for literal arguments:
   No implementation of function Function(<function impl_append at 0x7f2a3dcf4c10>) found for signature:

    >>> impl_append(ListType[array(float64, 2d, C)], array(float32, 1d, C))

   There are 2 candidate implementations:
         - Of which 2 did not match due to:
         Overload in function 'impl_append': File: numba/typed/listobject.py: Line 589.
           With argument(s): '(ListType[array(float64, 2d, C)], array(float32, 1d, C))':
          Rejected as the implementation raised a specific error:
            LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
          
          
          File "../env/lib/python3.8/site-packages/numba/typed/listobject.py", line 597:
              def impl(l, item):
                  casteditem = _cast(item, itemty)
                  ^
          
          During: lowering "$8call_function.3 = call $2load_global.0(item, $6load_deref.2, func=$2load_global.0, args=[Var(item, listobject.py:597), Var($6load_deref.2, listobject.py:597)], kws=(), vararg=None)" at /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/listobject.py (597)
     raised from /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/utils.py:81
   
   - Resolution failure for non-literal arguments:
   None
   
   During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[array(float64, 2d, C)])
   During: typing of call at /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/listobject.py (1051)
   
   
   File "../env/lib/python3.8/site-packages/numba/typed/listobject.py", line 1051:
               def impl(l, iterable):
                   <source elided>
                   for i in iterable:
                       l.append(i)
                       ^

  raised from /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/core/typeinfer.py:1071

- Resolution failure for non-literal arguments:
None

During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'extend') for ListType[array(float64, 2d, C)])
During: typing of call at /mnt/d/maserati/git/ticket_analysis/env/lib/python3.8/site-packages/numba/typed/typedlist.py (101)


File "../env/lib/python3.8/site-packages/numba/typed/typedlist.py", line 101:
def _extend(l, iterable):
    return l.extend(iterable)
    ^

I tried to upgrade to joblib 1.0.0 but I'm still getting the same error. Did someone receive the same error in the past?
Why not use pickle/dill instead of joblib==0.17.0 ?

llvmlite Error when install

Hi, I am trying to install BERTopic on mac and get the error:


    ----------------------------------------
ERROR: Command errored out with exit status 1: ./venv/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/39/88clnp910zlg54lrgy0d7qm40000gn/T/pip-install-0b7qoglk/llvmlite_1f4cd98020be43c1adad1fa52c6be7a7/setup.py'"'"'; __file__='"'"'/private/var/folders/39/88clnp910zlg54lrgy0d7qm40000gn/T/pip-install-0b7qoglk/llvmlite_1f4cd98020be43c1adad1fa52c6be7a7/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/39/88clnp910zlg54lrgy0d7qm40000gn/T/pip-record-mk9vn8xa/install-record.txt --single-version-externally-managed --compile --install-headers ./venv/include/site/python3.9/llvmlite Check the logs for full command output.

Any idea?

issue due to the _plotly_topic_visualization() method

there is an issue related to the _plotly_topic_visualization() method.
python 3.x, d.keys() returns an iterator (not an iterator), so giving a dictionary to the hove_data parameter when creating the fig will cause an error.
here is the code:

# plotting subjects
        fig = px.scatter(df, x="x", y="y", size="Size", size_max=40, template="simple_white", labels={"x": "", "y": ""},
                         hover_data={"x": False, "y": False, "Subject": True, "Words": True, "Size": True})

To solve this problem, simply use a list () :

# Plotting topics
        fig = px.scatter(df, x="x", y="y", size="Size", size_max=40, template="simple_white", labels={"x": "", "y": ""},
                         hover_data=list({"x": False, "y": False, "Topic": True, "Words": True, "Size": True}))

ValueError: k must be less than or equal to the number of training points`

Hi,

I want to use your pipeline with my own embeddings. However, I always get this error:

`2020-12-03 15:04:21,143 - BERTopic - Reduced dimensionality with UMAP
2020-12-03 15:04:21 - Reduced dimensionality with UMAP

ValueError Traceback (most recent call last)
in ()
9 npcorpus_embeds = np.array(corpus_embeds)
10
---> 11 topics = bmodel.fit_transform(cats, npcorpus_embeds)

4 frames
/usr/local/lib/python3.6/dist-packages/hdbscan/prediction.py in init(self, data, condensed_tree, min_samples, tree_type, metric, **kwargs)
102 self.tree = self._tree_type_map[tree_type](self.raw_data,
103 metric=metric, **kwargs)
--> 104 self.core_distances = self.tree.query(data, k=min_samples)[0][:, -1]
105 self.dist_metric = DistanceMetric.get_metric(metric, **kwargs)
106

sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._kd_tree.BinaryTree.query()

ValueError: k must be less than or equal to the number of training points`

I also tried using the built in embedding creation but got the same error. Do you know, what the problem could be?

No random state

Hi,
When I run the model with the same data and the same parameters, I get different clusters. Is there a way to fix random state for reproducibility?

Thanks

Compute Sub-Clusters

Hi,

used BERTopic on the arxiv dataset and extracted the most frequent topics (the biggest clusters).
Now I want to get the sub-clusters of the biggest cluster. What I did was to simply filter the documents and umap_embeddings with the corresponding cluster label and re-run hdbscan and c-TF-IDF on the sub-sets.
However, the results are not really satisfying. Even though my most frequent topic has a cluster size of 6756 I only get two sub-clusters. One with size 5810 and one with 579. If I repeat the process with the 5810 sub-cluster to get the sub-sub-clusters then hdbscan fails to make any clusters and all documents get label -1.

Is there something wrong about my approach? I feel like hdbscan should be able to find more clusters with cluster sizes of 6756 and 5810. For the first clustering I got 2733 clusters/topics.

The parameters are all on default.

Best
Karol

Languages bug

In

elif self.language.lower() in languages:
you check whether language argument is in languages list. But lowercase is applied first and languages list elements start with uppercase, so it is not possible to initialize BERTopic with any language from this list besides English (as another if case handles this).

Model load time

transform() and fit_transform() uses the same time to produce results. If I train the model, save it and load it again, it takes the same time to give predictions. How can I quickly get predictions once I train and save a model?

Retrieve the docs in a cluster

Hi Maarten,

First of all, thank you for this great tool and insight!

Am I missing something in the docs, how can I retrieve the indices of the docs belonging to a clusternumber? If this is not implemented yet, is there a quick workaround how I could do this?

The option nr_topics seems useless

Hello! This work is remarkable!
I got a problem when I trained a topic model using Chinese text data and my own sentence embeddings:

ๆ•่Žท

The info given by the program suggests that the number of topics had been reduced to 30, but when I accessed the results using get_topics(), I found there were still 93 topics, why this happened?

By the way, I sometimes came across Memory Out of Limit Error when running this package on my data, I think the reason is that I have millions of texts. Do you have any suggestions on how to apply this package to millions of texts?

model.transform does not return probabilities for new documents

rl_bertopic_model = BERTopic(language="english")
rl_bertopic_model = rl_bertopic_model.load(f'models/{model_name}')

new_doc = [r"some text"]

new_doc_topics, new_doc_probabilities = rl_bertopic_model.transform(new_doc[0])

new_doc_probabilities is None on 0.5.0 but it works fine on 0.4.3. This is the case regardless of whether the model is trained fresh or loaded from file.

I'm assuming this is related to the low_memory option introduced in 0.5.0. The wording here seems backwards: "If low_memory in BERTopic is set to False, then the probabilities are not calculated to speed up computation and decrease memory usage." - is this how it works? Seems like it should be the other way around.

Thanks

Loading datasets

I'm very excited to see that there is now an LDAVis alternative that works with embeddings! Your documentation and colab illustrate how to load the Newsgroup dataset. But could you also add sthg (for us less experienced users) about how to load local json files, using the 'abstract' field for the NLP, but keeping the metadata for other types of analyses - like for dynamic topic modelling (for examples of vizzes, see the also very cool DETM-tool ).

Does GPU help?

Hi, firstly thank you so much for this library. I've tried it and it does take some time to get the topics.
Just wondering, will having GPU help speed-wise? Is the speed bottle-necked at the sentence transformers embedding portion?

Make the algorithm less memory intensive

When using big data, it becomes infeasible to hold everything in memory at once.
Would it be possible to iterate over the data rather than hold it in memory?

It might also help exposing n_jobs parameter for UMAP so that the user has some control over the number of cores and therefore consumed memory.

Finetune on arxiv dataset

Hi,

thanks for your amazing work!
However, I currently still have some problems on getting good results.
I want to use BERTopic on the kaggle arxiv abstract dataset https://www.kaggle.com/Cornell-University/arxiv
It is a dataset that contains the abstract of each paper on arxiv. In total 1796908 abstracts, but I am using only 1/4 of them due to hardware constraints, so 449227 abstracts. The raw data is a list of dicts with each dict containing stuff like author, title, abstract and etc. but I am only using the abstracts itself.
My current results are sadly not what I expected. Here is the output of model.get_topics():

##################################
[('withdrawn', 0.12732245199899253), ('arxiv', 0.060818479638394804), ('author', 0.045282397053936205), ('been', 0.043582983757148634), ('paper', 0.04331377340066525), ('authors', 0.03602908119595011), ('has', 0.03413129955351502), ('discussion', 0.020046558277271205), ('version', 0.017570171724893863), ('error', 0.016245558058635576), ('due', 0.016034569088373845), ('4002', 0.015203603208166275), ('article', 0.015178787241213468), ('mcshane', 0.014512825764984364), ('1104', 0.013893798421411663), ('crucial', 0.012724570309587551), ('wyner', 0.011639183974558176), ('proxies', 0.011545341114998098), ('please', 0.011257392365683372), ('0804', 0.010829445454730597)]
##################################
[('withdrawn', 1.2378088374383105), ('been', 0.33161619452791685), ('paper', 0.2815599045047751), ('has', 0.2598521946696877), ('administratively', 0.037473809176819514), ('article', 0.035331755008345955), ('retracted', 0.032019088876856915), ('abstract', 0.03194552517951581), ('withdraw', 0.03023297555105207), ('submission', 0.025366426781504862), ('mistake', 0.024461766176310584), ('rewriting', 0.02108034213126981), ('want', 0.019899921380598113), ('this', 0.018769808691909386), ('shorter', 0.01690150874372038), ('comment', 0.01634104139337519), ('probably', 0.016086126678481083), ('applicable', 0.015210457362549933), ('modification', 0.014865572450063681), ('longer', 0.014582146616905768)]
##################################
[('isotopes', 0.2558417644790476), ('thirty', 0.22683223596981578), ('refereed', 0.19469496374394987), ('publication', 0.1454113600287234), ('isotope', 0.14061126024476908), ('brief', 0.11791781235255983), ('identification', 0.10522952641115933), ('discovery', 0.09816511283302375), ('summary', 0.08730514775501705), ('synopsis', 0.07636960227187568), ('production', 0.07302537404971297), ('discussed', 0.06821321035793458), ('including', 0.06506837933191251), ('presented', 0.06352529575007038), ('twenty', 0.057115384937416365), ('eight', 0.05686672933874315), ('each', 0.054793906411334324), ('far', 0.05099118599417039), ('minerals', 0.04545668048089448), ('observed', 0.04482361175054437)]
##################################
[('withdrawn', 0.8102220016577751), ('author', 0.5882810125654714), ('been', 0.21955498849750296), ('paper', 0.1935837075655028), ('has', 0.17516982841654938), ('pourmohammad', 0.08015698196117896), ('ali', 0.0605915947645577), ('seemann', 0.027628108579230422), ('eqn', 0.02270159607399226), ('admin', 0.022251661779234076), ('request', 0.01530721857357427), ('by', 0.013408498518361402), ('this', 0.012868228621997208), ('modification', 0.010599219818655269), ('authors', 0.010375596329419137), ('arxiv', 0.010147235836885003), ('km', 0.008868127574553979), ('due', 0.0053505630213037635), ('first', 0.004174426091124793), ('at', 0.0013290983743134937)]
##################################
[('de', 0.16471413053677894), ('la', 0.08859824729535025), ('un', 0.07960098252808794), ('en', 0.07656758724369946), ('des', 0.07493017494049987), ('une', 0.0685045487905329), ('est', 0.06506619811186878), ('nous', 0.0552461357202294), ('que', 0.0505970341092853), ('dans', 0.04833955653453861), ('pour', 0.04773108415278024), ('les', 0.04405246081293785), ('et', 0.04269515259835229), ('sur', 0.0425329786858753), ('caract', 0.034373522683066204), ('le', 0.03028301508609669), ('es', 0.029084319074609982), ('ees', 0.028840836535619835), ('cette', 0.023815804070613532), ('eme', 0.023083284220080345)]
##################################
[('model', 0.0029213211816859273), ('two', 0.0029181910122442487), ('it', 0.002917764978256985), ('can', 0.002911863114896897), ('these', 0.002900525114119986), ('our', 0.0028719993646575373), ('show', 0.0028703487897916566), ('results', 0.002862179058792491), ('also', 0.0028543448650093332), ('field', 0.002807120623162036), ('have', 0.0027961151449595436), ('using', 0.002780531524136966), ('between', 0.0027687481202621554), ('or', 0.002762760512175864), ('one', 0.0027467154286522437), ('time', 0.002741766841704294), ('energy', 0.0027274038973420667), ('data', 0.0026880146639130568), ('quantum', 0.0026615769324125527), ('such', 0.002660012066337066)]
##################################
[('withdrawn', 0.25382672910326465), ('arxiv', 0.1859321307804051), ('author', 0.10424878083447243), ('been', 0.09053395420662566), ('paper', 0.0810609839954744), ('has', 0.06435182608618999), ('version', 0.05760067485755415), ('authors', 0.05258540866955608), ('superseded', 0.04795847515378754), ('replaced', 0.043281616461112844), ('merged', 0.03743469139047417), ('0804', 0.03698167270671404), ('1008', 0.03085884218486115), ('because', 0.030835129343355642), ('0812', 0.023350724849799137), ('0901', 0.022639659892192746), ('revised', 0.02187828357506638), ('1306v6', 0.021542196549539806), ('submission', 0.02115347128457661), ('3484', 0.020833341465434623)]
##################################
[('withdrawn', 0.3174814989979916), ('author', 0.11776777433239484), ('been', 0.08872075679275165), ('paper', 0.08199984676632026), ('due', 0.08048541635175363), ('has', 0.07026260094965996), ('error', 0.05649266661391457), ('authors', 0.05390187230801506), ('arxiv', 0.051928726372487306), ('because', 0.034956486686744344), ('mistake', 0.032456919238108894), ('crucial', 0.0322646808282665), ('submission', 0.029450103971990518), ('administrators', 0.02776402639935557), ('admin', 0.024154968037124056), ('proof', 0.02232748069916814), ('errors', 0.017136278392237612), ('lemma', 0.015641869372412024), ('copyright', 0.015397080186955915), ('theorem', 0.014757663158028029)]

As you can see, the extracted topics are kind of bad and not what I have hoped for.
Can you give me some advice why this is not working and what I should finetune?

Best
Karol

`check_documents_type(documents)` in `utils.py` to support non-string portions of documents

Looking at the piece of code below in utils.py:

def check_documents_type(documents):
    """ Check whether the input documents are indeed a list of strings """
    if isinstance(documents, Iterable) and not isinstance(documents, str):
        if not any([isinstance(doc, str) for doc in documents]):
            raise TypeError("Make sure that the iterable only contains strings.")

    else:
        raise TypeError("Make sure that the documents variable is an iterable containing strings only.")

There are a lot of cases where a majority of a document is <class 'str'> and yet there will be an exception raised here.

Better support for such cases can be beneficial, for instance, to make a document with isinstance(documents, str) of True into an Iterable object or allowing a prop to decide what to do with numbers/dates/etc. within the document text.

There is also the case of double quotations within a text to show quotes from someone that also breaks the code resulting in a TypeError().

This solution may potentially return a modified version as the outcome of such alteration.

No attribute 'self.verbose'

On running new_topics, new_probabilities = model.transform(new_doc)

c:\tools\anaconda3\envs\autotag\lib\site-packages\bertopic\_bertopic.py in transform(self, documents, embeddings)
    349         if not isinstance(embeddings, np.ndarray):
    350             self.embedding_model = self._select_embedding_model()
--> 351             embeddings = self._extract_embeddings(documents, verbose=self.verbose)
    352 
    353         umap_embeddings = self.umap_model.transform(embeddings)

AttributeError: 'BERTopic' object has no attribute 'verbose'

Seems to be introduced in 0.5 - the issue wasn't present on 0.4.3.

Data Input (vs. LDA & NMF)

Hey this is an awesome project. My question is that how exactly do you pre-process long texts. I notice that the metadata you demo with are all units of short texts (most of them are one sentences each). I tried to imitate that by segmented my texts into sentences (split long texts by period and semicolons, and validated their lengths) while cleaning all the punctuations, but getting only 1 cluster. Any suggestion would be much appreciated. I understand how transformer models are different from topic models like LDA and NMF, but do you think it's possible for BERT and transformer models to do something similar, which is inputting several long text files and simply generate models without a limitation of text length. Thank you.

Inconsistency between topic with maximum probability and the predicted one for a document

Hi Maarten,

When using BERTopic on fetch_20newsgroups dataset to extract topics and their associated representative documents I figured out that for a given document the predicted topic was different from the one with the maximum probability. Of course, I checked it for topic label different from -1. In other words, it seems to have an inconsistency between predicted topics and probabilities. Is this normal ?

When we use the following:

topic_model = BERTopic(language="english", calculate_probabilities=True)
preds, probs = topic_model.fit_transform(docs)

For each index idx we should not have preds[idx] == numpy.argmax(probs[idx, :]) ?

Thank you in advance for your response.

custom dataset instructions

Hi,

Hope you are all well !

I wanted to apply BERTopic to a custom dataset, but can you provide more details about the input format for training a custom model ?

Thanks for any insights or inputs on that question.

Cheers,
X

Predict multiple topics per document

Would it be feasible to return the probabilities for all of the topics rather than only returning the best topic? This would be similar to LDA, where typically proportions or probabilities are returned for all topics.

I think this could be done by changing transform() to use membership_vector() instead of approximate_predict().

reduce_topics() changes probabilities in-place manner

When we use the method reduce_topics() it mutates the given probabilities parameter and it becomes identical with the returned probabilities. It would be better if it does not mutate the given one and 2 different probabilities for before and after.

Hyperparameter tuning

Hi,
Nice work on the package. I had a question.
The model is a lot sensitive to parameters. I was trying to prepare a pipeline to automatically find the best parameters. I am using outlier_count as my metric. Lower the nos of outliers, better the model.

I want to understand, Is this a right approach?

Thanks!

Apllying code in my own dataset

Hi, Thank you for this great job, i'm beginner in BERT, and i want to use your code to extract topics from arabic text (stored on MongDB), do you have an idea how can i do this? thank you so much.

BR

Proper exception handling for documents with no topics

I have come across a few cases in my corpus where probabilities[i] returns no probabilities that are equal or exceed min_probability and thus visualize_distribution will through and exception on vals = probabilities[labels_idx].tolist().

A better exception handling for these cases by showing an informative alert could be very handy instead of breaking the code.

Making embedding computation more scalable

Hello!

2020-10-31 14:35:53,446 - BERTopic - Loaded BERT model
INFO:BERTopic:Loaded BERT model
2020-10-31 15:29:37,627 - BERTopic - Transformed documents to Embeddings

It currently takes about an hour to compute embeddings for 20,000 documents in the 20 Newsgroups loaded with:

docs = fetch_20newsgroups(subset='all')['data']

To scale this better, one way is to use the bert-as-service with multiple workers. Have you thought of a possibility to make embedding computation pluggable?

inconsistency about outlier class in reduce_topics()

Hey, thanks for this great work.

The reduce_topics() mixes the topic with the biggest id and outlier class which has -1 class id. That happens because of the _map_probabilities() method. When the outlier topic determined for from_topic or to_topic variables, the method modifies the last element of the probabilities array which is not outlier class's probability, since outlier class's probability does not exist in the probability array.

More progressbars

Hi,

I see that there is an option for a tqdm progressbar in _extract_embeddings but no actual access to it. As the dataset I am working on is quite big I would like to have an estimate how long things are going to take.
Would it be possible to enable the progress bar from fit_transform?
I don't know if it is possible, but can you add progress bars for the other steps as well?

Best
Karol

huggingface/tokenizers: The current process just got forked, after parallelism has already been used

I'm having a strange warning during the function fit_transform

from bertopic import BERTopic

model_berttopic = BERTopic(language="english", verbose=True, stop_words="english")
topics, probabilities = model_berttopic.fit_transform(documents)
print(topics)
print(probabilities)
model_berttopic.save("bertopic_model")

and the output is

2021-01-27 11:31:07,461 - BERTopic - Loaded embedding model
2021-01-27 20:53:53,191 - BERTopic - Transformed documents to Embeddings
2021-01-27 20:57:27,904 - BERTopic - Reduced dimensionality with UMAP
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

The program is still running for hours without other logs (I expected Clustered UMAP embeddings with HDBSCAN, Loaded embedding model, Transformed documents to Embeddings and the save of the model). What is happening? There is no feedback of what is doing.

embedding_model bug

Thanks for providing an easy to use library. When setting an embedding_model parameter in bertopic initialization, it isn't loading the model I want but defaults to 'distilbert-base-nli-stsb-mean-tokens'. I think this is the case because the elif clause of _select_embedding_model function in _bertopic.py

def _select_embedding_model(self) -> SentenceTransformer:

self.language is referenced before self.embedding_model and since the default language value is 'english', it is returning the transformer models under the self.language clause in spite of whatever embedding models I choose.

ModuleNotFoundError when pip installing bertopic in venv

I am getting ModuleNotFoundError: No module named 'bertopic' while the output of pip install bertopic is as follows:

Requirement already satisfied: bertopic in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (0.3.4)
Requirement already satisfied: matplotlib in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (3.3.3)
Requirement already satisfied: pandas in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.1.5)
Requirement already satisfied: scikit-learn in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.23.2)
Requirement already satisfied: tqdm in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (4.54.1)
Requirement already satisfied: hdbscan in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.8.26)
Requirement already satisfied: numpy in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.19.4)
Requirement already satisfied: sentence-transformers in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.3.9)
Requirement already satisfied: joblib in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.0.0)
Requirement already satisfied: umap-learn in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (0.4.6)
Requirement already satisfied: torch in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from bertopic) (1.7.1)
Requirement already satisfied: python-dateutil>=2.1 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (2.4.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from matplotlib->bertopic) (8.0.1)
Requirement already satisfied: pytz>=2017.2 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from pandas->bertopic) (2020.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from scikit-learn->bertopic) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from scikit-learn->bertopic) (1.5.4)
Requirement already satisfied: six in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from hdbscan->bertopic) (1.14.0)
Requirement already satisfied: cython>=0.27 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from hdbscan->bertopic) (0.29.21)
Requirement already satisfied: nltk in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from sentence-transformers->bertopic) (3.5)
Requirement already satisfied: transformers<3.6.0,>=3.1.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from sentence-transformers->bertopic) (3.5.1)
Requirement already satisfied: numba!=0.47,>=0.46 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from umap-learn->bertopic) (0.52.0)
Requirement already satisfied: typing-extensions in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from torch->bertopic) (3.7.4.3)
Requirement already satisfied: click in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from nltk->sentence-transformers->bertopic) (7.1.2)
Requirement already satisfied: regex in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from nltk->sentence-transformers->bertopic) (2020.11.13)
Requirement already satisfied: requests in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (2.22.0)
Requirement already satisfied: protobuf in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (3.14.0)
Requirement already satisfied: sacremoses in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (0.0.43)
Requirement already satisfied: packaging in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (20.3)
Requirement already satisfied: sentencepiece==0.1.91 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (0.1.91)
Requirement already satisfied: tokenizers==0.9.3 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (0.9.3)
Requirement already satisfied: filelock in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from transformers<3.6.0,>=3.1.0->sentence-transformers->bertopic) (3.0.12)
Requirement already satisfied: llvmlite<0.36,>=0.35.0 in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from numba!=0.47,>=0.46->umap-learn->bertopic) (0.35.0)
Requirement already satisfied: setuptools in /home/arman/git/Scraping/venv/lib/python3.8/site-packages (from numba!=0.47,>=0.46->umap-learn->bertopic) (44.0.0)

Installation Error

I am running OS X 10.11.6.

$ rustc --version
rustc 1.46.0
$ cargo --version
cargo 1.45.1
...
error[E0554]: `#![feature]` may not be used on the stable release channel
    --> /Users/davidlaxer/.cargo/registry/src/github.com-1ecc6299db9ec823/lock_api-0.3.4/src/lib.rs:91:34
     |
  91 | #![cfg_attr(feature = "nightly", feature(const_fn))]
     |                                  ^^^^^^^^^^^^^^^^^
  

ValueError: k must be less than or equal to the number of training points

Hi Maarten,

I'm trying to get a topic on just a list of words: coffee, alcohol, drunk, cigarettes, smoking, drugs. So that I can have a topic called "Addiction" for example.

This is my code

from bertopic import BERTopic

docs = ['[CLS]', '[UNK]', 'coffee', 'alcohol', '[UNK]', 'drunk', 'cigarettes', 'smoking', 'drugs', '[SEP]']

model = BERTopic(verbose=True)
topics = model.fit_transform(docs)

And this is the error that I'm getting:

2021-02-08 23:08:18,794 - BERTopic - Loaded embedding model
2021-02-08 23:08:18,856 - BERTopic - Transformed documents to Embeddings
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/umap/umap_.py:2214: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
2021-02-08 23:08:21,252 - BERTopic - Reduced dimensionality with UMAP
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bertopic/_bertopic.py", line 278, in fit_transform
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bertopic/_bertopic.py", line 753, in _cluster_embeddings
    self.cluster_model = hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 922, in fit
    self.generate_prediction_data()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 961, in generate_prediction_data
    self._prediction_data = PredictionData(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdbscan/prediction.py", line 104, in __init__
    self.core_distances = self.tree.query(data, k=min_samples)[0][:, -1]
  File "sklearn/neighbors/_binary_tree.pxi", line 1342, in sklearn.neighbors._kd_tree.BinaryTree.query
ValueError: k must be less than or equal to the number of training points

Can it be something in the parameter settings? I can't figure it out, any help is very appreciated.

Allow passing all keyword arguments of CountVectorizer to BERTopic constructor

Hi
I think it could be great if we can pass all existing keyword arguments of CountVectorizer to BERTopic and not only n_gram_range and stop_words as of today.
Some of them like max_df, min_df, strip_accents or even tokenizer can be of great help when finetuning a model.

It could be done by changing the signature of the __init__ method from

    def __init__(self,
                 bert_model: str = 'distilbert-base-nli-mean-token``s',
                 top_n_words: int = 20,
                 nr_topics: int = None,
                 n_gram_range: Tuple[int, int] = (1, 1),
                 min_topic_size: int = 30,
                 n_neighbors: int = 15,
                 n_components: int = 5,
                 stop_words: Union[str, List[str]] = None,
                 verbose: bool = False)

to

    def __init__(self,
                 bert_model: str = 'distilbert-base-nli-mean-token``s',
                 top_n_words: int = 20,
                 nr_topics: int = None,
                 min_topic_size: int = 30,
                 n_neighbors: int = 15,
                 n_components: int = 5,
                 verbose: bool = False,
                 **kwargs)

Then storing the kwargs dictionary as a class attribute self.kwargs
And then in _c_tf_idf

count = CountVectorizer(**self.kwargs).fit(documents)

I can even provide a PR if you want

Thx again for this great package

Olivier Terrier
@kairntech

Expose batch_size parameter

It would be nice to have a control of the batch size that converts the docs to embeddings.
The default is 32 and during my run of the algorithm, the GPU memory never exceeded 2.5/16G. It could improve the speed of the embeddings extraction.

Use SentenceTransformer through Flair

Hi Maarten
Thx again for this great package and the 0.5 release is just amazing.
Regarding Flair vs SentenceTransformer maybe it could be interesting to always use Flair even for SentenceTransformer:

Flair has a top level class DocumentEmbeddings and several implementation among

TransformerDocumentEmbeddings
SentenceTransformerDocumentEmbeddings
DocumentTFIDFEmbeddings
DocumentPoolEmbeddings

What do you think?

Best regards

Olivier

Error when install on windows

Hi everyone,
when i try to install bertopic on windows (pip install bertopic) i get an error. The problem arises on this line: " Building wheels for collected packages: hdbscan".
The next line is: Building wheel for hdbscan (PEP 517) ... error
...
ERROR: Failed building wheel for hdbscan
Failed to build hdbscan
ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed directly.

I tried to use different python version (3.5 - 3.6 - 3.7) but nothing has changed.
Did someone have the same problem and solved it?
Thank you all,

Andrea.

Issue when using n_gram_range other than (1,1)

Hi, really nice work with this package, it's very useful.

Model initiation takes the arguement n_gram_range, but I think that it doesn't get used. Should line 241 referenced here be
count = CountVectorizer(ngram_range=n_gram_range, stop_words="english").fit(documents)?

count = CountVectorizer(stop_words="english").fit(documents)

It might be nice to have the stop_words argument be configurable at initiation as well, so that the user could pass a corpus-specific set of stop words.

Plot customization

Hi Maarten!

Firstly, congratulations for your work.

I would like to suggest you allow the BERTopic "_plotly_topic_visualization" function from "visualize_topics()" to not only show the plotly figure but also to return the figure as a variable. This will be useful because with the figure in a variable the user can download the figure as HTML, pdf etc. In my case, I would like to embed the figure in a Dashboard.

Besides, I would suggest you allow the user to access to the parameters used during the Topic Modelling (i.e, UMAP, HDBSCAN, plotly visualization).

Text preprocessing

Hi! Thanks for developing this awesome library!

I have a question regarding text preprocessing.

From what I understand, the model takes List[str] as an input - basically a list of fulltext documents.

But do we need to preprocess texts somehow before passing it into the model?

With LDA, I usually preprocess texts (tokenize, lemmatize, remove stopwords, create n-grams, etc.) before running models. But since we're dealing with word embeddings, keeping all words in their original form is important for the context, right?

So I'm not sure how to proceed, should I use list of preprocessed words as an input, or leave texts untouched, or something in between (keeping text as a string but without stopwords, etc.)?

GPU utility issue

Hello! When I tried to train the model using my local GPU, it shows that even though the code takes up some memory of the GPU, the GPU utility stays to be 0. Could you please give me some hints to solve this issue?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.