maximtrp / bitermplus Goto Github PK
View Code? Open in Web Editor NEWBiterm Topic Model (BTM): modeling topics in short texts
Home Page: https://bitermplus.readthedocs.io/en/stable/
License: MIT License
Biterm Topic Model (BTM): modeling topics in short texts
Home Page: https://bitermplus.readthedocs.io/en/stable/
License: MIT License
Hi,
Firstly, thanks for sharing your code.
Not an issue, just a question. I'm able to see the relevant words for a topic in the tmplot report. How do I get those words? I need to get at least the most three relevant terms.
Thanks in advance.
Hello there,
I am able to generate the model and visualize it. But when I tried to find the closest topics and stable topics, I get the error for code line:
closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=139, verbose=True)
The error is:
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
This is despite me separately checking the array size and it is 2-D. I am pasting the code below. Pl. can you check if I am doing anything wrong.
Thank you.
X, vocabulary, vocab_dict = btm.get_words_freqs(clean_text, max_df=.85, min_df=15,ngram_range=(1,2))
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(clean_text, vocabulary)
# Generating biterms
Y = X.todense()
biterms = btm.get_biterms(docs_vec, 15)
# INITIALIZING AND RUNNING MODEL
model = btm.BTM(X, vocabulary, T=8, M=10, alpha=500/1000, beta=0.01, win=15, has_background= True)
model.fit(biterms, iterations=500, verbose=True)
p_zd = model.transform(docs_vec,verbose=True)
print(p_zd)
# matrix of document-topics; topics vs. documents, topics vs. words probabilities
matrix_docs_topics = model.matrix_docs_topics_ #Documents vs topics probabilities matrix.
topic_doc_matrix = model.matrix_topics_docs_ #Topics vs documents probabilities matrix.
matrix_topic_words = model.matrix_topics_words_ #Topics vs words probabilities matrix.
# Getting stable topics
print("Array Dimension = ",len(matrix_topic_words.shape))
closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=100, verbose=True)
stable_topics, stable_kl = btm.get_stable_topics(closest_topics, thres=0.7)
# Stable topics indices list
print(stable_topics)
Hi!
I've got an error when running pip3 install bitermplus
on MacOS (intel-based, Ventura), using python 3.10.8 in a separate venv (not anaconda):
Building wheels for collected packages: bitermplus
Building wheel for bitermplus (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building wheel for bitermplus (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [34 lines of output]
Error in sitecustomize; set PYTHONVERBOSE for traceback:
AssertionError:
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-12-x86_64-cpython-310
creating build/lib.macosx-12-x86_64-cpython-310/bitermplus
copying src/bitermplus/__init__.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
copying src/bitermplus/_util.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
running egg_info
writing src/bitermplus.egg-info/PKG-INFO
writing dependency_links to src/bitermplus.egg-info/dependency_links.txt
writing requirements to src/bitermplus.egg-info/requires.txt
writing top-level names to src/bitermplus.egg-info/top_level.txt
reading manifest file 'src/bitermplus.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'src/bitermplus.egg-info/SOURCES.txt'
copying src/bitermplus/_btm.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
copying src/bitermplus/_btm.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
copying src/bitermplus/_metrics.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
copying src/bitermplus/_metrics.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
running build_ext
building 'bitermplus._btm' extension
creating build/temp.macosx-12-x86_64-cpython-310
creating build/temp.macosx-12-x86_64-cpython-310/src
creating build/temp.macosx-12-x86_64-cpython-310/src/bitermplus
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-12-x86_64-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
#include <omp.h>
^~~~~~~
1 error generated.
error: command '/usr/bin/clang' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
Could this error be related to #29? I've tested on a PC and it worked though.
Hello,
I was wondering if it is possible to remove stop words from the word frequency counts call - this one:
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
Thanks for your help,
Chitra
Hi @maximtrp, I am trying to use bitermplus for topic modeling. Running the code shows the error I mentioned in the title. Seems sth in get_words_freqs function goes wrong. I appreciate if you advise how I can fix that.
This error sometimes happens and sometimes doesn't, it doesn't seem to be related to docs_vec[i]. The location of the error report also changes, such as 0/244232,856/244232
100%|██████████| 20/20 [00:23<00:00, 1.13s/it]
0%| | 0/244232 [00:00<?, ?it/s]
Traceback (most recent call last):
File "F:/code/depression_tieba/topic model/biterm22.py", line 34, in <module>
p_zd = model.transform(docs_vec)
File "src\bitermplus\_btm.pyx", line 446, in bitermplus._btm.BTM.transform
File "src\bitermplus\_btm.pyx", line 481, in bitermplus._btm.BTM.transform
File "src\bitermplus\_btm.pyx", line 331, in bitermplus._btm.BTM._infer_doc
File "src\bitermplus\_btm.pyx", line 365, in bitermplus._btm.BTM._infer_doc_sum_b
File "stringsource", line 153, in View.MemoryView.array.__cinit__
ValueError: Invalid shape in axis 0: 0.
docs_vec[0]
[ 86027 26789 50200 7758 66912 79522 40559 65192 34724 75526, 93343 50346 44309 60165 46216 102898 21657 42681]
[9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07 3.12592152e-07]
[9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08 2.43742411e-08]
[9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07 1.83996702e-07]
[9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07 2.77376339e-07]
[9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10 3.94318712e-10]
[9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07 3.92884503e-07]
Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
When I try to install bitermplus with pip install bitermplus there is an error massage like this :
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, when i use the marked sample to train the model. i got the unexpeted result. Firstly, the marked samples contain 5 types, but trained model get a huge perlexity when the the number of topic is 5. Secondly, when i test the topic parameter from 1 to 20, the perplexity was reduced following the increase of topic number. my code is following:
df = pd.read_csv('dataPretreatment/data/corpus.txt', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()
print(df)
stop_words = segmentWord.stopwordslist()
perplexitys = []
coherences = []
for T in range(1,21,1):
print(T)
X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words)
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
# Generating biterms
biterms = btm.get_biterms(docs_vec)
# INITIALIZING AND RUNNING MODEL
model = btm.BTM(X, vocabulary, seed=12321, T=T, M=50, alpha=50/T, beta=0.01)
model.fit(biterms, iterations=2000)
p_zd = model.transform(docs_vec)
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, T)
coherence = model.coherence_
perplexitys.append(perplexity)
coherences.append(coherence)
``
Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, i find the vocubulary input to BTM will filter the single word. At the angle of english, such is a good approach to filter the meaningless vocubularies, but at the angle of other language, like chinese, some single vacubularies are meaning for semantic understanding, did u provide an interface to close such filter mechanism. I appreciate if you advise for that.
From my understanding, biterm.perplexity()
takes in three inputs: p_wz
, the topics vs. words probabilities matrix (T x W); p_zd
, the documents vs. topics probabilities matrix (D x T); and T
, the number of topics. Those inputs are often the same output of other topic models, as well.
May I ask if it is possible to use biterm.perplexity()
to calculate the perplexity by Heinrich (2005) of other topic models?
Thank you!
Hi! Thank you for sharing this repository. This is easy to use. Especially combination with tmplot
is really nice.
Here, I would like to ask the implementation to calculate a perplexity. When I used this repo, I noticed the scale is little bit strange. So I tested the implementation as follows:
import bitermplus as btm
import numpy as np
# Parameter; choose '1' or '4' to run assertion
topics_num = 1
# Generate teet data
word_list = [
["white", "black", "red", "green", "blue"],
["dog", "cat", "fish", "bird", "rabbit"],
["apple", "banana", "lemon", "orange", "melon"],
["Japan", "America", "China", "England", "France"],
]
documents = [
" ".join(np.random.choice(word_list[topic], 100))
for topic in range(len(word_list)) for i in range(100)
]
# Vectorizing documents, obtaining full vocabulary and biterms
X, vocabulary, _ = btm.get_words_freqs(documents)
docs_vec = btm.get_vectorized_docs(documents, vocabulary)
biterms = btm.get_biterms(docs_vec)
print("Modeling started")
model = btm.BTM(
X,
vocabulary,
seed=12321,
T=topics_num,
M=20,
alpha=50 / topics_num,
beta=0.01,
)
model.fit(biterms, iterations=100)
print("Evaluating perplexity")
# Run this to assign self.p_zd
model.transform(docs_vec)
perplexity = model.perplexity_
if topics_num == 1:
assert perplexity > 10, "Perplexity should be about 20 when the number of topic is one."
elif topics_num == 4:
assert perplexity < 10, "Perplexity should be about five when the number of topic is four."
assert perplexity > 0, "Perplexity must be greater than zero."
assert perplexity > 5, "Perplexity should be greater than five as each topic has five words."
This test fails, because I got 7.620787225693601 when the topics_num is one, and 3.343172868550849 when the topics_num is four. I think both are too small.
I would be glad if you share the knowledge to this phenomenon !
I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model?
My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?
with open('model{}.pickle'.format(best_topic), 'rb') as f:
best_model = pkl.load(f)
user_vec = btm.get_vectorized_docs(user_texts, vocabulary) # len(user_texts) = 400
user_td = best_model.transform(user_vec) # _**However, len(user_td) = 245081 (Length of the training corpus)**_
user_topic = pd.DataFrame()
user_topic['portrait'] = user['portrait'] # len(user_topic['portrait']) = 400
user_topic['topic'] = list(user_td) # ERROR 245081 != 400
Hi,
I can't install the library on Mac OS with an intel chip.
I'm in JupiterLab, latest version of Python3, latest pip/wheel/setuptools. I have libomp installed using brew. I have Xcode-select installed.
Error is as follows:
Building wheels for collected packages: bitermplus
Building wheel for bitermplus (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building wheel for bitermplus (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [21 lines of output]
/private/var/folders/p0/b89vzkwj6s59mbscpyll6dzm0000gn/T/pip-build-env-pxw75s_w/overlay/lib/python3.11/site-packages/setuptools/config/pyprojecttoml.py:66: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
config = read_configuration(filepath, True, ignore_option_errors, dist)
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-10.9-universal2-cpython-311
creating build/lib.macosx-10.9-universal2-cpython-311/bitermplus
copying src/bitermplus/__init__.py -> build/lib.macosx-10.9-universal2-cpython-311/bitermplus
copying src/bitermplus/_util.py -> build/lib.macosx-10.9-universal2-cpython-311/bitermplus
running build_ext
building 'bitermplus._btm' extension
creating build/temp.macosx-10.9-universal2-cpython-311
creating build/temp.macosx-10.9-universal2-cpython-311/src
creating build/temp.macosx-10.9-universal2-cpython-311/src/bitermplus
clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-311/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
src/bitermplus/_btm.c:776:10: fatal error: 'omp.h' file not found
#include <omp.h>
^~~~~~~
1 error generated.
error: command '/usr/bin/clang' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
Any help would be appreciated!
I was wondering is there any way to print the the topics generate by the BTM model, just like how I can do it with Gensim. In addition to that, I am getting all negative coherence values in the range of -500 or -600. I am not sure if I am doing something wrong. The issues is, I am not able to interpret the results, even plotting gives some strange output.
The following image show what is held by the variable adobe, again I am not sure if it needs to be in this manner or each row here needs to a list
Can we add to the doc
sudo apt-get install python3.x-dev
as part of installation before running pip on linux when installing on a virtual environment and diff python executable?
else I am getting wheel error
error: subprocess-exited-with-error
× Building wheel for bitermplus (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-cpython-38
creating build/lib.linux-x86_64-cpython-38/bitermplus
copying src/bitermplus/__init__.py -> build/lib.linux-x86_64-cpython-38/bitermplus
copying src/bitermplus/_util.py -> build/lib.linux-x86_64-cpython-38/bitermplus
running build_ext
building 'bitermplus._btm' extension
creating build/temp.linux-x86_64-cpython-38
creating build/temp.linux-x86_64-cpython-38/src
creating build/temp.linux-x86_64-cpython-38/src/bitermplus
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/home/******/.virtualenvs/tfcert/include -I/usr/include/python3.8 -c src/bitermplus/_btm.c -o build/temp.linux-x86_64-cpython-38/src/bitermplus/_btm.o -fopenmp
src/bitermplus/_btm.c:35:10: fatal error: Python.h: No such file or directory
35 | #include "Python.h"
| ^~~~~~~~~~
compilation terminated.
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
After running the model, how can I view the proportion of each document to a topic to assign back to the dataframe? Or if that isn't possible, the documents associated with a topic as shown in the tmp.report(..) plot.
OS: windows
Error
Traceback (most recent call last):
File "D:\ProgramData\Anaconda\lib\unittest\case.py", line 59, in testPartExecutor
yield
File "D:\ProgramData\Anaconda\lib\unittest\case.py", line 615, in run
testMethod()
File "D:\PYworkspace\bitermplus-main\tests\test_btm.py", line 32, in test_btm_class
model.fit(biterms, iterations=20)
File "src\bitermplus\_btm.pyx", line 226, in bitermplus._btm.BTM.fit
File "src\bitermplus\_btm.pyx", line 236, in bitermplus._btm.BTM.fit
File "src\bitermplus\_btm.pyx", line 172, in bitermplus._btm.BTM._biterms_to_array
ValueError: Buffer dtype mismatch, expected 'long' but got 'long long'
I wonder that under what circumstances the perplexity is inf
Hi,
I have recently found that running btm.get_words_freqs(texts, stop_words=stop_words) gives me ValueError: too many values to unpack (expected 3).
I am not sure the reason, but i found this methods returns 4 variables, the last one is CountVectorizer(stop_words=['word1', 'word2', 'word3'])
First of all, thank you for code.When I tried to visualize the results using the tmplot library, I didn't get the file. Later,When I found the saved report.html file and opened it, I entered a strange webpage. Can you help me answer this question?
creating build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
#include <omp.h>
^~~~~~~
1 error generated.
error: command '/usr/bin/clang' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
I wonder where the names of the topics exist?
I'm trying to test the model and see if it matches the data labels, but I can't get the topic for each document. I'm trying to get the list of labels to apply nmi, ami and ri so I'm wondering how to get the labels from the model. @maximtrp
Hi,
Thanks for sharing your code.
I fitted a Biterm topic model based on my lemmas and sklearn's CountVectorizer. My dataset is about German reviews on TVs and washing machines.
Unfortunately, the get_top_topic_words
yields unreasonable results:
Thus, I used your tmplot package to see, whether I could reconstruct it: It turns out, that I get similar results with lambda=1 inside tmp.report
. Using a lower value than that, results in more reasonable words.
Trying to apply it directly, I played around with tmp's helper functions which resulted in this code:
from tmplot._helpers import get_phi
from tmplot._helpers import calc_terms_probs_ratio
calc_terms_probs_ratio(get_phi(biterm_model),0)['Terms'].to_list()[:20]
I get the following output, which does make sense in the context of my project and is equal to the output of tmplot (where lambda < 1):
As per get_top_topic_words
documentation it returns the words with highest probabilities in all selected topics. I am not sure what exactly I am missing out: Am I missing some mathematical context? Is there any possibility to extend this method to use custom lambda values?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.