maximtrp / bitermplus Goto Github PK

Biterm Topic Model (BTM): modeling topics in short texts

Home Page: https://bitermplus.readthedocs.io/en/stable/

License: MIT License

Python 31.41% Cython 68.59%

topic-modeling nlp nlp-machine-learning natural-language-processing topic-models cython biterm-topic-model btm python visualization machine-learning data-science

bitermplus's Issues

How do I get the topic words?

Hi,

Firstly, thanks for sharing your code.

Not an issue, just a question. I'm able to see the relevant words for a topic in the tmplot report. How do I get those words? I need to get at least the most three relevant terms.

Thanks in advance.

Cannot find Closest topics and Stable topics

Hello there,
I am able to generate the model and visualize it. But when I tried to find the closest topics and stable topics, I get the error for code line:

closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=139, verbose=True)

The error is:

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

This is despite me separately checking the array size and it is 2-D. I am pasting the code below. Pl. can you check if I am doing anything wrong.

Thank you.

X, vocabulary, vocab_dict = btm.get_words_freqs(clean_text, max_df=.85, min_df=15,ngram_range=(1,2))

# Vectorizing documents
docs_vec = btm.get_vectorized_docs(clean_text, vocabulary)

# Generating biterms
Y = X.todense()
biterms = btm.get_biterms(docs_vec, 15)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(X, vocabulary, T=8, M=10, alpha=500/1000, beta=0.01, win=15, has_background= True)
model.fit(biterms, iterations=500, verbose=True)
p_zd = model.transform(docs_vec,verbose=True)  
print(p_zd) 

# matrix of document-topics; topics vs. documents, topics vs. words probabilities 
matrix_docs_topics = model.matrix_docs_topics_    #Documents vs topics probabilities matrix.
topic_doc_matrix = model.matrix_topics_docs_      #Topics vs documents probabilities matrix.
matrix_topic_words = model.matrix_topics_words_   #Topics vs words probabilities matrix.

# Getting stable topics
print("Array Dimension = ",len(matrix_topic_words.shape))
closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=100, verbose=True)
stable_topics, stable_kl = btm.get_stable_topics(closest_topics, thres=0.7)

# Stable topics indices list
print(stable_topics)

failed building wheels

Hi!

I've got an error when running pip3 install bitermplus on MacOS (intel-based, Ventura), using python 3.10.8 in a separate venv (not anaconda):

Building wheels for collected packages: bitermplus
  Building wheel for bitermplus (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for bitermplus (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [34 lines of output]
      Error in sitecustomize; set PYTHONVERBOSE for traceback:
      AssertionError:
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-12-x86_64-cpython-310
      creating build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/__init__.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_util.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running egg_info
      writing src/bitermplus.egg-info/PKG-INFO
      writing dependency_links to src/bitermplus.egg-info/dependency_links.txt
      writing requirements to src/bitermplus.egg-info/requires.txt
      writing top-level names to src/bitermplus.egg-info/top_level.txt
      reading manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      adding license file 'LICENSE'
      writing manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      copying src/bitermplus/_btm.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_btm.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running build_ext
      building 'bitermplus._btm' extension
      creating build/temp.macosx-12-x86_64-cpython-310
      creating build/temp.macosx-12-x86_64-cpython-310/src
      creating build/temp.macosx-12-x86_64-cpython-310/src/bitermplus
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-12-x86_64-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
      src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
      #include <omp.h>
               ^~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

Could this error be related to #29? I've tested on a PC and it worked though.

remove STOPWORDS

Hello,

I was wondering if it is possible to remove stop words from the word frequency counts call - this one:
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)

Thanks for your help,
Chitra

Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out'

Hi @maximtrp, I am trying to use bitermplus for topic modeling. Running the code shows the error I mentioned in the title. Seems sth in get_words_freqs function goes wrong. I appreciate if you advise how I can fix that.

ValueError: Invalid shape in axis 0: 0.

This error sometimes happens and sometimes doesn't, it doesn't seem to be related to docs_vec[i]. The location of the error report also changes, such as 0/244232,856/244232

100%|██████████| 20/20 [00:23<00:00,  1.13s/it]
  0%|          | 0/244232 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "F:/code/depression_tieba/topic model/biterm22.py", line 34, in <module>
    p_zd = model.transform(docs_vec)
  File "src\bitermplus\_btm.pyx", line 446, in bitermplus._btm.BTM.transform
  File "src\bitermplus\_btm.pyx", line 481, in bitermplus._btm.BTM.transform
  File "src\bitermplus\_btm.pyx", line 331, in bitermplus._btm.BTM._infer_doc
  File "src\bitermplus\_btm.pyx", line 365, in bitermplus._btm.BTM._infer_doc_sum_b
  File "stringsource", line 153, in View.MemoryView.array.__cinit__
ValueError: Invalid shape in axis 0: 0.

docs_vec[0]
[ 86027 26789 50200 7758 66912 79522 40559 65192 34724 75526, 93343 50346 44309 60165 46216 102898 21657 42681]

the topic distribution for all doc is similar

topic

[9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07 3.12592152e-07]
[9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08 2.43742411e-08]
[9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07 1.83996702e-07]
[9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07 2.77376339e-07]
[9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10 3.94318712e-10]
[9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07 3.92884503e-07]

Failed building wheel for bitermplus

Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

When I try to install bitermplus with pip install bitermplus there is an error massage like this :
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

Got an unexpected result in marked sample

Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, when i use the marked sample to train the model. i got the unexpeted result. Firstly, the marked samples contain 5 types, but trained model get a huge perlexity when the the number of topic is 5. Secondly, when i test the topic parameter from 1 to 20, the perplexity was reduced following the increase of topic number. my code is following:
df = pd.read_csv('dataPretreatment/data/corpus.txt', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()
print(df)
stop_words = segmentWord.stopwordslist()
perplexitys = []
coherences = []

for T in range(1,21,1):
print(T)
X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words)
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
# Generating biterms
biterms = btm.get_biterms(docs_vec)
# INITIALIZING AND RUNNING MODEL
model = btm.BTM(X, vocabulary, seed=12321, T=T, M=50, alpha=50/T, beta=0.01)
model.fit(biterms, iterations=2000)
p_zd = model.transform(docs_vec)
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, T)
coherence = model.coherence_
perplexitys.append(perplexity)
coherences.append(coherence)

The vocabularies input into BTM

Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, i find the vocubulary input to BTM will filter the single word. At the angle of english, such is a good approach to filter the meaningless vocubularies, but at the angle of other language, like chinese, some single vacubularies are meaning for semantic understanding, did u provide an interface to close such filter mechanism. I appreciate if you advise for that.

Using `biterm.perplexity()` for Calculating Perplexity of Other Topic Models

From my understanding, biterm.perplexity() takes in three inputs: p_wz, the topics vs. words probabilities matrix (T x W); p_zd, the documents vs. topics probabilities matrix (D x T); and T, the number of topics. Those inputs are often the same output of other topic models, as well.

May I ask if it is possible to use biterm.perplexity() to calculate the perplexity by Heinrich (2005) of other topic models?

Thank you!

Calculating wrong perplexity?

@maximtrp

Hi! Thank you for sharing this repository. This is easy to use. Especially combination with tmplot is really nice.

Here, I would like to ask the implementation to calculate a perplexity. When I used this repo, I noticed the scale is little bit strange. So I tested the implementation as follows:

import bitermplus as btm
import numpy as np

# Parameter; choose '1' or '4' to run assertion
topics_num = 1

# Generate teet data
word_list = [
    ["white", "black", "red", "green", "blue"],
    ["dog", "cat", "fish", "bird", "rabbit"],
    ["apple", "banana", "lemon", "orange", "melon"],
    ["Japan", "America", "China", "England", "France"],
]
documents = [
    " ".join(np.random.choice(word_list[topic], 100))
    for topic in range(len(word_list)) for i in range(100)
]
# Vectorizing documents, obtaining full vocabulary and biterms
X, vocabulary, _ = btm.get_words_freqs(documents)
docs_vec = btm.get_vectorized_docs(documents, vocabulary)
biterms = btm.get_biterms(docs_vec)

print("Modeling started")
model = btm.BTM(
    X,
    vocabulary,
    seed=12321,
    T=topics_num,
    M=20,
    alpha=50 / topics_num,
    beta=0.01,
)
model.fit(biterms, iterations=100)

print("Evaluating perplexity")
# Run this to assign self.p_zd
model.transform(docs_vec)
perplexity = model.perplexity_
if topics_num == 1:
    assert perplexity > 10, "Perplexity should be about 20 when the number of topic is one." 
elif topics_num == 4:
    assert perplexity < 10, "Perplexity should be about five when the number of topic is four."
    assert perplexity > 0, "Perplexity must be greater than zero."
    assert perplexity > 5, "Perplexity should be greater than five as each topic has five words."

This test fails, because I got 7.620787225693601 when the topics_num is one, and 3.343172868550849 when the topics_num is four. I think both are too small.

I would be glad if you share the knowledge to this phenomenon !

Is it possible to contain only those words that occur in max 90% and min 10% of documents in function X, vocabulary, vocab_dict = btm.get_words_freqs()

Questions regarding Perplexity and Model Comparison with C++

I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model?
My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?

How can I transform a new document using an already trained model?

with open('model{}.pickle'.format(best_topic), 'rb') as f:
    best_model = pkl.load(f)

user_vec = btm.get_vectorized_docs(user_texts, vocabulary) # len(user_texts) = 400
user_td = best_model.transform(user_vec) # _**However, len(user_td) = 245081 （Length of the training corpus）**_
user_topic = pd.DataFrame()
user_topic['portrait'] = user['portrait'] # len(user_topic['portrait'])  = 400
user_topic['topic'] = list(user_td) # ERROR   245081 != 400

Installation errors with Mac OS

Hi,

I can't install the library on Mac OS with an intel chip.

I'm in JupiterLab, latest version of Python3, latest pip/wheel/setuptools. I have libomp installed using brew. I have Xcode-select installed.

Error is as follows:

Building wheels for collected packages: bitermplus
  Building wheel for bitermplus (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for bitermplus (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [21 lines of output]
      /private/var/folders/p0/b89vzkwj6s59mbscpyll6dzm0000gn/T/pip-build-env-pxw75s_w/overlay/lib/python3.11/site-packages/setuptools/config/pyprojecttoml.py:66: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
        config = read_configuration(filepath, True, ignore_option_errors, dist)
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-10.9-universal2-cpython-311
      creating build/lib.macosx-10.9-universal2-cpython-311/bitermplus
      copying src/bitermplus/__init__.py -> build/lib.macosx-10.9-universal2-cpython-311/bitermplus
      copying src/bitermplus/_util.py -> build/lib.macosx-10.9-universal2-cpython-311/bitermplus
      running build_ext
      building 'bitermplus._btm' extension
      creating build/temp.macosx-10.9-universal2-cpython-311
      creating build/temp.macosx-10.9-universal2-cpython-311/src
      creating build/temp.macosx-10.9-universal2-cpython-311/src/bitermplus
      clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-311/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
      src/bitermplus/_btm.c:776:10: fatal error: 'omp.h' file not found
      #include <omp.h>
               ^~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

Any help would be appreciated!

Implementation Guide

I was wondering is there any way to print the the topics generate by the BTM model, just like how I can do it with Gensim. In addition to that, I am getting all negative coherence values in the range of -500 or -600. I am not sure if I am doing something wrong. The issues is, I am not able to interpret the results, even plotting gives some strange output.

The following image show what is held by the variable adobe, again I am not sure if it needs to be in this manner or each row here needs to a list

Is it possible to contain only those words that occur in minimum 90% of documents or

Linux Installation of pythonx.x-dev needed if installing in a virtual environment

Can we add to the doc
sudo apt-get install python3.x-dev
as part of installation before running pip on linux when installing on a virtual environment and diff python executable?

else I am getting wheel error

error: subprocess-exited-with-error

  × Building wheel for bitermplus (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [19 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-38
      creating build/lib.linux-x86_64-cpython-38/bitermplus
      copying src/bitermplus/__init__.py -> build/lib.linux-x86_64-cpython-38/bitermplus
      copying src/bitermplus/_util.py -> build/lib.linux-x86_64-cpython-38/bitermplus
      running build_ext
      building 'bitermplus._btm' extension
      creating build/temp.linux-x86_64-cpython-38
      creating build/temp.linux-x86_64-cpython-38/src
      creating build/temp.linux-x86_64-cpython-38/src/bitermplus
      x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/home/******/.virtualenvs/tfcert/include -I/usr/include/python3.8 -c src/bitermplus/_btm.c -o build/temp.linux-x86_64-cpython-38/src/bitermplus/_btm.o -fopenmp
      src/bitermplus/_btm.c:35:10: fatal error: Python.h: No such file or directory
         35 | #include "Python.h"
            |          ^~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

How do I get the probabilities of each document for a given topic?

After running the model, how can I view the proportion of each document to a topic to assign back to the dataframe? Or if that isn't possible, the documents associated with a topic as shown in the tmp.report(..) plot.

ValueError: Buffer dtype mismatch, expected 'long' but got 'long long'

OS： windows

Error

Traceback (most recent call last):
  File "D:\ProgramData\Anaconda\lib\unittest\case.py", line 59, in testPartExecutor
    yield
  File "D:\ProgramData\Anaconda\lib\unittest\case.py", line 615, in run
    testMethod()
  File "D:\PYworkspace\bitermplus-main\tests\test_btm.py", line 32, in test_btm_class
    model.fit(biterms, iterations=20)
  File "src\bitermplus\_btm.pyx", line 226, in bitermplus._btm.BTM.fit
  File "src\bitermplus\_btm.pyx", line 236, in bitermplus._btm.BTM.fit
  File "src\bitermplus\_btm.pyx", line 172, in bitermplus._btm.BTM._biterms_to_array
ValueError: Buffer dtype mismatch, expected 'long' but got 'long long'

The Perplexity is inf

I wonder that under what circumstances the perplexity is inf

ValueError: too many values to unpack (expected 3)

Hi,

I have recently found that running btm.get_words_freqs(texts, stop_words=stop_words) gives me ValueError: too many values to unpack (expected 3).

I am not sure the reason, but i found this methods returns 4 variables, the last one is CountVectorizer(stop_words=['word1', 'word2', 'word3'])

Visualization poblem

First of all, thank you for code.When I tried to visualize the results using the tmplot library, I didn't get the file. Later，When I found the saved report.html file and opened it, I entered a strange webpage. Can you help me answer this question?

ERROR: Failed building wheel for bitermplus

creating build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
#include <omp.h>
^~~~~~~
1 error generated.
error: command '/usr/bin/clang' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

Topics' names?

I wonder where the names of the topics exist?

Calculation of nmi,ami,ri

I'm trying to test the model and see if it matches the data labels, but I can't get the topic for each document. I'm trying to get the list of labels to apply nmi, ami and ri so I'm wondering how to get the labels from the model. @maximtrp

How to give each feature a weight value?

Hi,
Thanks for sharing your code.

get_top_topic_words yields unreasonable results

I fitted a Biterm topic model based on my lemmas and sklearn's CountVectorizer. My dataset is about German reviews on TVs and washing machines.

Unfortunately, the get_top_topic_words yields unreasonable results:

Thus, I used your tmplot package to see, whether I could reconstruct it: It turns out, that I get similar results with lambda=1 inside tmp.report. Using a lower value than that, results in more reasonable words.

Trying to apply it directly, I played around with tmp's helper functions which resulted in this code:

from tmplot._helpers import get_phi
from tmplot._helpers import calc_terms_probs_ratio
calc_terms_probs_ratio(get_phi(biterm_model),0)['Terms'].to_list()[:20]

I get the following output, which does make sense in the context of my project and is equal to the output of tmplot (where lambda < 1):

As per get_top_topic_words documentation it returns the words with highest probabilities in all selected topics. I am not sure what exactly I am missing out: Am I missing some mathematical context? Is there any possibility to extend this method to use custom lambda values?

maximtrp / bitermplus Goto Github PK

bitermplus's Issues

topic

Recommend Projects

Recommend Topics

Recommend Org