Comments (11)
I have found the cause: random number generator was broken under Windows and MacOS (but not under Linux). The new version of package will be released shortly.
from bitermplus.
Hello! Thank you for reporting, but this is obviously not enough to give any feedback. Please provide the information on your corpus (total number of documents and words), model parameters and fitting process (number of iterations), and package version.
from bitermplus.
The corpus is Chinese text split with spaces.
Model parameters :T=i, W=vocab.size, M=20, alpha=50/i, beta=0.01
100%|██████████| 18374731/18374731 [00:05<00:00, 3271476.08it/s]
100%|██████████| 20/20 [01:12<00:00, 3.64s/it]
100%|██████████| 45388/45388 [00:03<00:00, 12323.28it/s]
Version: 0.5.10
Thank you for your help!
from bitermplus.
This is the output of test_btm.py when I run it on the SearchSnippets.txt corpus
100%|██████████| 641202/641202 [00:00<00:00, 2503526.78it/s]
100%|██████████| 20/20 [00:02<00:00, 7.64it/s]
100%|██████████| 12295/12295 [00:00<00:00, 158025.11it/s]
This is the result of p_zd /documents vs topics probability matrix.
[[9.99999967e-01 4.72269092e-09 4.72269092e-09 ... 4.72269092e-09
4.72269092e-09 4.72269092e-09]
[9.99999130e-01 1.24323721e-07 1.24323721e-07 ... 1.24323721e-07
1.24323721e-07 1.24323721e-07]
[9.99999968e-01 4.62112701e-09 4.62112701e-09 ... 4.62112701e-09
4.62112701e-09 4.62112701e-09]
...
[1.00000000e+00 2.54964359e-13 2.54964359e-13 ... 2.54964359e-13
2.54964359e-13 2.54964359e-13]
[1.00000000e+00 3.10026664e-13 3.10026664e-13 ... 3.10026664e-13
3.10026664e-13 3.10026664e-13]
[9.99999712e-01 4.11230259e-08 4.11230259e-08 ... 4.11230259e-08
4.11230259e-08 4.11230259e-08]]
from bitermplus.
Running 20 iterations may lead to such results. This is simply not enough for the model to converge. My recent experiments show that model perplexity stabilizes somewhere around 500 iterations.
But even with such a small number of iterations I cannot replicate this result. Could you please give the full code you are using and also pass seed
value to model fit
method?
from bitermplus.
class TestBTM(unittest.TestCase):
# Plotting tests
def test_btm_class(self):
with gzip_open('../dataset/SearchSnippets.txt.gz', 'rb') as file:
texts = file.readlines()
X, vocab = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(X)
biterms = btm.get_biterms(X)
LOGGER.info('Modeling started')
model = btm.BTM(X, vocab, T=8, W=vocab.size, M=20, alpha=50/8, beta=0.01)
# t1 = time.time()
model.fit(biterms, seed=12345, iterations=20)
# t2 = time.time()
# LOGGER.info(t2 - t1)
self.assertIsInstance(model.matrix_topics_words_, np.ndarray)
self.assertTupleEqual(model.matrix_topics_words_.shape, (8, vocab.size))
LOGGER.info('Modeling finished')
LOGGER.info('Inference started')
p_zd = model.transform(docs_vec)
print("sum_b",p_zd)
LOGGER.info('Inference "sum_b" finished')
p_zd = model.transform(docs_vec, infer_type='sum_w')
print("sum_w",p_zd)
LOGGER.info('Inference "sum_w" finished')
p_zd = model.transform(docs_vec, infer_type='mix')
print("mix",p_zd)
LOGGER.info('Inference "mix" finished')
LOGGER.info('Perplexity started')
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
self.assertIsInstance(perplexity, float)
self.assertNotEqual(perplexity, 0.)
LOGGER.info('Perplexity finished')
LOGGER.info('Coherence started')
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
self.assertIsInstance(coherence, np.ndarray)
self.assertGreater(coherence.shape[0], 0)
LOGGER.info('Coherence finished')
LOGGER.info('Model saving/loading started')
with open('model.pickle', 'wb') as file:
self.assertIsNone(pkl.dump(model, file))
with open('model.pickle', 'rb') as file:
self.assertIsInstance(pkl.load(file), btm._btm.BTM)
LOGGER.info('Model saving/loading finished')
if __name__ == '__main__':
unittest.main()
I ran tests/test_btm.py directly and printed the results of p_zd for different infer_types, without making any changes to the code.
100%|██████████| 641202/641202 [00:00<00:00, 3105825.15it/s]
100%|██████████| 20/20 [00:02<00:00, 8.80it/s]
100%|██████████| 12295/12295 [00:00<00:00, 54307.48it/s]
sum_b
[[9.99989362e-01 1.51973980e-06 1.51973980e-06 ... 1.51973980e-06
1.51973980e-06 1.51973980e-06]
[9.99885135e-01 1.64092341e-05 1.64092341e-05 ... 1.64092341e-05
1.64092341e-05 1.64092341e-05]
[9.99941434e-01 8.36655753e-06 8.36655753e-06 ... 8.36655753e-06
8.36655753e-06 8.36655753e-06]
...
[9.99962688e-01 5.33031226e-06 5.33031226e-06 ... 5.33031226e-06
5.33031226e-06 5.33031226e-06]
[9.99977867e-01 3.16187601e-06 3.16187601e-06 ... 3.16187601e-06
3.16187601e-06 3.16187601e-06]
[9.99899798e-01 1.43145008e-05 1.43145008e-05 ... 1.43145008e-05
1.43145008e-05 1.43145008e-05]]
sum_w
100%|██████████| 12295/12295 [00:00<00:00, 88027.49it/s]
[[9.99971071e-01 4.13271496e-06 4.13271496e-06 ... 4.13271496e-06
4.13271496e-06 4.13271496e-06]
[9.99908825e-01 1.30250357e-05 1.30250357e-05 ... 1.30250357e-05
1.30250357e-05 1.30250357e-05]
[9.99932596e-01 9.62911447e-06 9.62911447e-06 ... 9.62911447e-06
9.62911447e-06 9.62911447e-06]
...
[9.99949473e-01 7.21817072e-06 7.21817072e-06 ... 7.21817072e-06
7.21817072e-06 7.21817072e-06]
[9.99959500e-01 5.78571297e-06 5.78571297e-06 ... 5.78571297e-06
5.78571297e-06 5.78571297e-06]
[9.99910889e-01 1.27301811e-05 1.27301811e-05 ... 1.27301811e-05
1.27301811e-05 1.27301811e-05]]
mix
100%|██████████| 12295/12295 [00:00<00:00, 181185.33it/s]
[[9.99999971e-01 4.08965415e-09 4.08965415e-09 ... 4.08965415e-09
4.08965415e-09 4.08965415e-09]
[9.99998995e-01 1.43548690e-07 1.43548690e-07 ... 1.43548690e-07
1.43548690e-07 1.43548690e-07]
[9.99999967e-01 4.68500846e-09 4.68500846e-09 ... 4.68500846e-09
4.68500846e-09 4.68500846e-09]
...
[1.00000000e+00 3.58952802e-13 3.58952802e-13 ... 3.58952802e-13
3.58952802e-13 3.58952802e-13]
[1.00000000e+00 3.92627489e-13 3.92627489e-13 ... 3.92627489e-13
3.92627489e-13 3.92627489e-13]
[9.99999686e-01 4.48075858e-08 4.48075858e-08 ... 4.48075858e-08
4.48075858e-08 4.48075858e-08]]
from bitermplus.
Still I cannot replicate your results. I am getting more sensible values with the same code. Please post the output of pip list
and the results obtained with 200 iterations (not 20).
from bitermplus.
Package Version
---------------------------------- ------------
-atplotlib 2.2.3
-illow 5.2.0
absl-py 0.10.0
alabaster 0.7.11
anaconda-client 1.7.2
anaconda-navigator 1.9.2
anaconda-project 0.8.2
appdirs 1.4.3
Appium-Python-Client 1.0.2
asn1crypto 0.24.0
astor 0.8.1
astroid 2.0.4
astropy 3.0.4
astunparse 1.6.3
atomicwrites 1.2.1
attrs 18.2.0
Automat 0.7.0
Babel 2.6.0
backcall 0.1.0
backports.shutil-get-terminal-size 1.0.0
beautifulsoup4 4.6.3
bert-serving 0.0.1
bert-serving-client 1.10.0
bert-serving-server 1.10.0
bert-tensorflow 1.0.4
bibtexparser 1.2.0
bitarray 0.8.3
bitermplus 0.5.10
bkcharts 0.2
blaze 0.11.3
bleach 2.1.4
bokeh 0.13.0
boto 2.49.0
boto3 1.10.48
botocore 1.13.48
Bottleneck 1.2.1
cachetools 4.1.1
certifi 2020.11.8
cffi 1.11.5
chardet 3.0.4
click 6.7
cloudpickle 0.5.5
clyent 1.2.2
cmake 3.18.4.post1
colorama 0.3.9
comtypes 1.1.7
conda 4.8.3
conda-build 3.15.1
conda-package-handling 1.7.0
constantly 15.1.0
contextlib2 0.5.5
cryptography 3.0
cssselect 1.1.0
cycler 0.10.0
Cython 0.29.14
cytoolz 0.9.0.1
dask 0.19.1
datashape 0.5.4
decorator 4.3.0
defusedxml 0.5.0
distributed 1.23.1
docutils 0.14
docx 0.2.4
emoji 0.6.0
entrypoints 0.2.3
et-xmlfile 1.0.1
fake-useragent 0.1.11
Faker 4.1.1
fastcache 1.0.2
filelock 3.0.8
Flask 1.0.2
Flask-Cors 3.0.6
funcy 1.15
future 0.18.2
gast 0.3.3
gensim 3.8.3
gevent 1.3.6
glob2 0.6
google-api-core 1.23.0
google-auth 1.23.0
google-auth-oauthlib 0.4.1
google-cloud-language 2.0.0
google-pasta 0.2.0
googleapis-common-protos 1.52.0
GPUtil 1.4.0
greenlet 0.4.15
grpcio 1.33.2
h5py 2.10.0
heapdict 1.0.0
html5lib 1.0.1
hyperlink 18.0.0
idna 2.10
imageio 2.4.1
imagesize 1.1.0
importlib-metadata 1.7.0
incremental 17.5.0
ipykernel 4.10.0
ipython 6.5.0
ipython-genutils 0.2.0
ipywidgets 7.4.1
isort 4.3.4
itsdangerous 0.24
jdcal 1.4
jedi 0.12.1
jieba 0.42.1
Jinja2 2.10
jmespath 0.10.0
joblib 0.16.0
jsonschema 2.6.0
jupyter 1.0.0
jupyter-client 5.2.3
jupyter-console 5.2.0
jupyter-core 4.4.0
jupyterlab 0.34.9
jupyterlab-launcher 0.13.1
Keras 2.2.4
Keras-Applications 1.0.8
keras-bert 0.82.0
keras-embed-sim 0.8.0
keras-layer-normalization 0.14.0
keras-multi-head 0.27.0
keras-pos-embd 0.11.0
keras-position-wise-feed-forward 0.6.0
Keras-Preprocessing 1.1.2
keras-self-attention 0.46.0
keras-transformer 0.38.0
keyring 13.2.1
kiwisolver 1.0.1
lazy-object-proxy 1.3.1
libcst 0.3.13
llvmlite 0.24.0
locket 0.2.0
lxml 4.2.5
Markdown 3.2.2
MarkupSafe 1.0
matplotlib 3.3.2
mccabe 0.6.1
menuinst 1.4.14
mglearn 0.1.9
mistune 0.8.3
mkl-fft 1.0.4
mkl-random 1.0.1
mmdnn 0.1.3
mock 4.0.2
more-itertools 4.3.0
mouse 0.7.1
move 0.1.3
mpmath 1.0.0
msgpack 0.5.6
MulticoreTSNE 0.1
multipledispatch 0.6.0
mypy-extensions 0.4.3
mysqlclient 2.0.1
navigator-updater 0.2.1
nbconvert 5.4.0
nbformat 4.4.0
networkx 2.1
nltk 3.5
nose 1.3.7
notebook 5.6.0
numba 0.39.0
numexpr 2.6.8
numpy 1.18.5
numpydoc 0.8.0
oauthlib 3.1.0
odo 0.5.1
olefile 0.46
openpyxl 2.5.6
opt-einsum 3.3.0
packaging 17.1
pandas 0.23.4
pandocfilters 1.4.2
parsel 1.5.2
parso 0.3.1
partd 0.3.8
path.py 11.1.0
pathlib2 2.3.2
patsy 0.5.0
pep8 1.7.1
pickleshare 0.7.4
Pillow 7.2.0
pip 20.2.2
pkginfo 1.4.2
pluggy 0.7.1
ply 3.11
prometheus-client 0.3.1
prompt-toolkit 1.0.15
proto-plus 1.11.0
protobuf 3.13.0
psutil 5.4.7
py 1.6.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycodestyle 2.4.0
pycosat 0.6.3
pycparser 2.18
pycrypto 2.6.1
pycurl 7.43.0.5
PyDispatcher 2.0.5
pyflakes 2.0.0
Pygments 2.2.0
PyHamcrest 2.0.2
pyLDAvis 2.1.2
pylint 2.1.1
pymongo 3.11.0
PyMouse 1.0
PyMySQL 0.10.0
pyodbc 4.0.24
pyOpenSSL 19.1.0
pyparsing 2.2.0
PySocks 1.6.8
pytest 3.8.0
pytest-arraydiff 0.2
pytest-astropy 0.4.0
pytest-doctestplus 0.1.3
pytest-openfiles 0.3.0
pytest-remotedata 0.3.0
pytest-runner 5.2
python-dateutil 2.7.3
python-docx 0.8.10
pytorch-pretrained-bert 0.6.2
pytz 2020.4
PyWavelets 1.0.0
pywin32 223
pywinpty 0.5.4
PyYAML 5.3.1
pyzmq 17.1.2
qt5reactor 0.6.3
QtAwesome 0.4.4
qtconsole 4.4.1
QtPy 1.5.0
queuelib 1.5.0
redis 3.5.3
regex 2020.7.14
requests 2.25.0
requests-oauthlib 1.3.0
rope 0.11.0
rsa 4.6
ruamel-yaml 0.15.46
s3transfer 0.2.1
scikit-image 0.14.0
scikit-learn 0.19.2
scipy 1.4.1
Scrapy 1.6.0
seaborn 0.9.0
selenium 3.141.0
Send2Trash 1.5.0
service-identity 17.0.0
setuptools 50.3.2
simplegeneric 0.8.1
singledispatch 3.4.0.3
six 1.15.0
smart-open 2.1.1
snowballstemmer 1.2.1
snownlp 0.12.3
sortedcollections 1.0.1
sortedcontainers 2.0.5
Sphinx 1.7.9
sphinxcontrib-websupport 1.1.0
spyder 3.3.1
spyder-kernels 0.2.6
SQLAlchemy 1.2.11
statsmodels 0.9.0
stop-words 2018.7.23
sympy 1.1.1
tables 3.4.4
tblib 1.3.2
tensorboard 1.14.0
tensorboard-plugin-wit 1.7.0
tensorflow-estimator 1.14.0
tensorflow-gpu 1.14.0
termcolor 1.1.0
terminado 0.8.1
testpath 0.3.1
text-unidecode 1.3
toolz 0.9.0
torch 1.6.0+cu101
torchvision 0.7.0+cu101
tornado 5.1
tqdm 4.26.0
traitlets 4.3.2
Twisted 18.7.0
typing-extensions 3.7.4.3
typing-inspect 0.6.0
unicodecsv 0.14.1
urllib3 1.25.10
w3lib 1.21.0
wcwidth 0.1.7
webencodings 0.5.1
Werkzeug 0.14.1
wheel 0.31.1
widgetsnbextension 3.4.1
win-inet-pton 1.0.1
win-unicode-console 0.5
wincertstore 0.2
wrapt 1.12.1
xlrd 1.1.0
XlsxWriter 1.1.0
xlwings 0.11.8
xlwt 1.3.0
zict 0.1.3
zipp 3.1.0
zope.interface 4.5.0
Iterations = 500 gives the same result
from bitermplus.
I have managed to reproduce this bug under MacOS and Windows, but model is being fitted correctly under Linux. I will try to figure out the cause
from bitermplus.
My OS is Win10.
I created a new virtual environment and updated the packages, but it still doesn't work.
from bitermplus.
Thank you again for drawing attention to this problem! It is now fixed, and the new release is available in PyPi.
from bitermplus.
Related Issues (20)
- Is it possible to contain only those words that occur in max 90% and min 10% of documents in function X, vocabulary, vocab_dict = btm.get_words_freqs() HOT 1
- Cannot find Closest topics and Stable topics HOT 4
- How to give each feature a weight value?
- Questions regarding Perplexity and Model Comparison with C++ HOT 3
- get_top_topic_words yields unreasonable results HOT 1
- Implementation Guide HOT 2
- Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out' HOT 4
- The vocabularies input into BTM HOT 1
- Got an unexpected result in marked sample HOT 7
- Calculation of nmi,ami,ri HOT 2
- Failed building wheel for bitermplus HOT 2
- failed building wheels HOT 2
- ERROR: Failed building wheel for bitermplus HOT 9
- Calculating wrong perplexity? HOT 2
- Using `biterm.perplexity()` for Calculating Perplexity of Other Topic Models HOT 1
- Visualization poblem HOT 7
- Installation errors with Mac OS HOT 2
- Topics' names? HOT 3
- Linux Installation of pythonx.x-dev needed if installing in a virtual environment HOT 1
- ValueError: too many values to unpack (expected 3) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bitermplus.