Coder Social home page Coder Social logo

Comments (11)

maximtrp avatar maximtrp commented on September 27, 2024 1

I have found the cause: random number generator was broken under Windows and MacOS (but not under Linux). The new version of package will be released shortly.

from bitermplus.

maximtrp avatar maximtrp commented on September 27, 2024

Hello! Thank you for reporting, but this is obviously not enough to give any feedback. Please provide the information on your corpus (total number of documents and words), model parameters and fitting process (number of iterations), and package version.

from bitermplus.

JennieGerhardt avatar JennieGerhardt commented on September 27, 2024

The corpus is Chinese text split with spaces.
Model parameters :T=i, W=vocab.size, M=20, alpha=50/i, beta=0.01

100%|██████████| 18374731/18374731 [00:05<00:00, 3271476.08it/s]
100%|██████████| 20/20 [01:12<00:00, 3.64s/it]
100%|██████████| 45388/45388 [00:03<00:00, 12323.28it/s]

Version: 0.5.10

Thank you for your help!

from bitermplus.

JennieGerhardt avatar JennieGerhardt commented on September 27, 2024

This is the output of test_btm.py when I run it on the SearchSnippets.txt corpus
100%|██████████| 641202/641202 [00:00<00:00, 2503526.78it/s]
100%|██████████| 20/20 [00:02<00:00, 7.64it/s]
100%|██████████| 12295/12295 [00:00<00:00, 158025.11it/s]

This is the result of p_zd /documents vs topics probability matrix.
[[9.99999967e-01 4.72269092e-09 4.72269092e-09 ... 4.72269092e-09
4.72269092e-09 4.72269092e-09]
[9.99999130e-01 1.24323721e-07 1.24323721e-07 ... 1.24323721e-07
1.24323721e-07 1.24323721e-07]
[9.99999968e-01 4.62112701e-09 4.62112701e-09 ... 4.62112701e-09
4.62112701e-09 4.62112701e-09]
...
[1.00000000e+00 2.54964359e-13 2.54964359e-13 ... 2.54964359e-13
2.54964359e-13 2.54964359e-13]
[1.00000000e+00 3.10026664e-13 3.10026664e-13 ... 3.10026664e-13
3.10026664e-13 3.10026664e-13]
[9.99999712e-01 4.11230259e-08 4.11230259e-08 ... 4.11230259e-08
4.11230259e-08 4.11230259e-08]]

from bitermplus.

maximtrp avatar maximtrp commented on September 27, 2024

Running 20 iterations may lead to such results. This is simply not enough for the model to converge. My recent experiments show that model perplexity stabilizes somewhere around 500 iterations.

But even with such a small number of iterations I cannot replicate this result. Could you please give the full code you are using and also pass seed value to model fit method?

from bitermplus.

JennieGerhardt avatar JennieGerhardt commented on September 27, 2024
class TestBTM(unittest.TestCase):

    # Plotting tests
    def test_btm_class(self):
        with gzip_open('../dataset/SearchSnippets.txt.gz', 'rb') as file:
            texts = file.readlines()

        X, vocab = btm.get_words_freqs(texts)
        docs_vec = btm.get_vectorized_docs(X)
        biterms = btm.get_biterms(X)

        LOGGER.info('Modeling started')
        model = btm.BTM(X, vocab, T=8, W=vocab.size, M=20, alpha=50/8, beta=0.01)
        # t1 = time.time()
        model.fit(biterms, seed=12345, iterations=20)
        # t2 = time.time()
        # LOGGER.info(t2 - t1)
        self.assertIsInstance(model.matrix_topics_words_, np.ndarray)
        self.assertTupleEqual(model.matrix_topics_words_.shape, (8, vocab.size))
        LOGGER.info('Modeling finished')

        LOGGER.info('Inference started')
        p_zd = model.transform(docs_vec)
        print("sum_b",p_zd)
        LOGGER.info('Inference "sum_b" finished')
        p_zd = model.transform(docs_vec, infer_type='sum_w')
        print("sum_w",p_zd)
        LOGGER.info('Inference "sum_w" finished')
        p_zd = model.transform(docs_vec, infer_type='mix')
        print("mix",p_zd)
        LOGGER.info('Inference "mix" finished')

        LOGGER.info('Perplexity started')
        perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
        self.assertIsInstance(perplexity, float)
        self.assertNotEqual(perplexity, 0.)
        LOGGER.info('Perplexity finished')

        LOGGER.info('Coherence started')
        coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
        self.assertIsInstance(coherence, np.ndarray)
        self.assertGreater(coherence.shape[0], 0)
        LOGGER.info('Coherence finished')

        LOGGER.info('Model saving/loading started')
        with open('model.pickle', 'wb') as file:
            self.assertIsNone(pkl.dump(model, file))
        with open('model.pickle', 'rb') as file:
            self.assertIsInstance(pkl.load(file), btm._btm.BTM)
        LOGGER.info('Model saving/loading finished')

if __name__ == '__main__':
    unittest.main()

I ran tests/test_btm.py directly and printed the results of p_zd for different infer_types, without making any changes to the code.
100%|██████████| 641202/641202 [00:00<00:00, 3105825.15it/s]
100%|██████████| 20/20 [00:02<00:00, 8.80it/s]
100%|██████████| 12295/12295 [00:00<00:00, 54307.48it/s]

sum_b

[[9.99989362e-01 1.51973980e-06 1.51973980e-06 ... 1.51973980e-06
  1.51973980e-06 1.51973980e-06]
 [9.99885135e-01 1.64092341e-05 1.64092341e-05 ... 1.64092341e-05
  1.64092341e-05 1.64092341e-05]
 [9.99941434e-01 8.36655753e-06 8.36655753e-06 ... 8.36655753e-06
  8.36655753e-06 8.36655753e-06]
 ...
 [9.99962688e-01 5.33031226e-06 5.33031226e-06 ... 5.33031226e-06
  5.33031226e-06 5.33031226e-06]
 [9.99977867e-01 3.16187601e-06 3.16187601e-06 ... 3.16187601e-06
  3.16187601e-06 3.16187601e-06]
 [9.99899798e-01 1.43145008e-05 1.43145008e-05 ... 1.43145008e-05
  1.43145008e-05 1.43145008e-05]]

sum_w

100%|██████████| 12295/12295 [00:00<00:00, 88027.49it/s]

[[9.99971071e-01 4.13271496e-06 4.13271496e-06 ... 4.13271496e-06
  4.13271496e-06 4.13271496e-06]
 [9.99908825e-01 1.30250357e-05 1.30250357e-05 ... 1.30250357e-05
  1.30250357e-05 1.30250357e-05]
 [9.99932596e-01 9.62911447e-06 9.62911447e-06 ... 9.62911447e-06
  9.62911447e-06 9.62911447e-06]
 ...
 [9.99949473e-01 7.21817072e-06 7.21817072e-06 ... 7.21817072e-06
  7.21817072e-06 7.21817072e-06]
 [9.99959500e-01 5.78571297e-06 5.78571297e-06 ... 5.78571297e-06
  5.78571297e-06 5.78571297e-06]
 [9.99910889e-01 1.27301811e-05 1.27301811e-05 ... 1.27301811e-05
  1.27301811e-05 1.27301811e-05]]

mix

100%|██████████| 12295/12295 [00:00<00:00, 181185.33it/s]

[[9.99999971e-01 4.08965415e-09 4.08965415e-09 ... 4.08965415e-09
  4.08965415e-09 4.08965415e-09]
 [9.99998995e-01 1.43548690e-07 1.43548690e-07 ... 1.43548690e-07
  1.43548690e-07 1.43548690e-07]
 [9.99999967e-01 4.68500846e-09 4.68500846e-09 ... 4.68500846e-09
  4.68500846e-09 4.68500846e-09]
 ...
 [1.00000000e+00 3.58952802e-13 3.58952802e-13 ... 3.58952802e-13
  3.58952802e-13 3.58952802e-13]
 [1.00000000e+00 3.92627489e-13 3.92627489e-13 ... 3.92627489e-13
  3.92627489e-13 3.92627489e-13]
 [9.99999686e-01 4.48075858e-08 4.48075858e-08 ... 4.48075858e-08
  4.48075858e-08 4.48075858e-08]]

from bitermplus.

maximtrp avatar maximtrp commented on September 27, 2024

Still I cannot replicate your results. I am getting more sensible values with the same code. Please post the output of pip list and the results obtained with 200 iterations (not 20).

from bitermplus.

JennieGerhardt avatar JennieGerhardt commented on September 27, 2024
Package                            Version
---------------------------------- ------------
-atplotlib                         2.2.3
-illow                             5.2.0
absl-py                            0.10.0
alabaster                          0.7.11
anaconda-client                    1.7.2
anaconda-navigator                 1.9.2
anaconda-project                   0.8.2
appdirs                            1.4.3
Appium-Python-Client               1.0.2
asn1crypto                         0.24.0
astor                              0.8.1
astroid                            2.0.4
astropy                            3.0.4
astunparse                         1.6.3
atomicwrites                       1.2.1
attrs                              18.2.0
Automat                            0.7.0
Babel                              2.6.0
backcall                           0.1.0
backports.shutil-get-terminal-size 1.0.0
beautifulsoup4                     4.6.3
bert-serving                       0.0.1
bert-serving-client                1.10.0
bert-serving-server                1.10.0
bert-tensorflow                    1.0.4
bibtexparser                       1.2.0
bitarray                           0.8.3
bitermplus                         0.5.10
bkcharts                           0.2
blaze                              0.11.3
bleach                             2.1.4
bokeh                              0.13.0
boto                               2.49.0
boto3                              1.10.48
botocore                           1.13.48
Bottleneck                         1.2.1
cachetools                         4.1.1
certifi                            2020.11.8
cffi                               1.11.5
chardet                            3.0.4
click                              6.7
cloudpickle                        0.5.5
clyent                             1.2.2
cmake                              3.18.4.post1
colorama                           0.3.9
comtypes                           1.1.7
conda                              4.8.3
conda-build                        3.15.1
conda-package-handling             1.7.0
constantly                         15.1.0
contextlib2                        0.5.5
cryptography                       3.0
cssselect                          1.1.0
cycler                             0.10.0
Cython                             0.29.14
cytoolz                            0.9.0.1
dask                               0.19.1
datashape                          0.5.4
decorator                          4.3.0
defusedxml                         0.5.0
distributed                        1.23.1
docutils                           0.14
docx                               0.2.4
emoji                              0.6.0
entrypoints                        0.2.3
et-xmlfile                         1.0.1
fake-useragent                     0.1.11
Faker                              4.1.1
fastcache                          1.0.2
filelock                           3.0.8
Flask                              1.0.2
Flask-Cors                         3.0.6
funcy                              1.15
future                             0.18.2
gast                               0.3.3
gensim                             3.8.3
gevent                             1.3.6
glob2                              0.6
google-api-core                    1.23.0
google-auth                        1.23.0
google-auth-oauthlib               0.4.1
google-cloud-language              2.0.0
google-pasta                       0.2.0
googleapis-common-protos           1.52.0
GPUtil                             1.4.0
greenlet                           0.4.15
grpcio                             1.33.2
h5py                               2.10.0
heapdict                           1.0.0
html5lib                           1.0.1
hyperlink                          18.0.0
idna                               2.10
imageio                            2.4.1
imagesize                          1.1.0
importlib-metadata                 1.7.0
incremental                        17.5.0
ipykernel                          4.10.0
ipython                            6.5.0
ipython-genutils                   0.2.0
ipywidgets                         7.4.1
isort                              4.3.4
itsdangerous                       0.24
jdcal                              1.4
jedi                               0.12.1
jieba                              0.42.1
Jinja2                             2.10
jmespath                           0.10.0
joblib                             0.16.0
jsonschema                         2.6.0
jupyter                            1.0.0
jupyter-client                     5.2.3
jupyter-console                    5.2.0
jupyter-core                       4.4.0
jupyterlab                         0.34.9
jupyterlab-launcher                0.13.1
Keras                              2.2.4
Keras-Applications                 1.0.8
keras-bert                         0.82.0
keras-embed-sim                    0.8.0
keras-layer-normalization          0.14.0
keras-multi-head                   0.27.0
keras-pos-embd                     0.11.0
keras-position-wise-feed-forward   0.6.0
Keras-Preprocessing                1.1.2
keras-self-attention               0.46.0
keras-transformer                  0.38.0
keyring                            13.2.1
kiwisolver                         1.0.1
lazy-object-proxy                  1.3.1
libcst                             0.3.13
llvmlite                           0.24.0
locket                             0.2.0
lxml                               4.2.5
Markdown                           3.2.2
MarkupSafe                         1.0
matplotlib                         3.3.2
mccabe                             0.6.1
menuinst                           1.4.14
mglearn                            0.1.9
mistune                            0.8.3
mkl-fft                            1.0.4
mkl-random                         1.0.1
mmdnn                              0.1.3
mock                               4.0.2
more-itertools                     4.3.0
mouse                              0.7.1
move                               0.1.3
mpmath                             1.0.0
msgpack                            0.5.6
MulticoreTSNE                      0.1
multipledispatch                   0.6.0
mypy-extensions                    0.4.3
mysqlclient                        2.0.1
navigator-updater                  0.2.1
nbconvert                          5.4.0
nbformat                           4.4.0
networkx                           2.1
nltk                               3.5
nose                               1.3.7
notebook                           5.6.0
numba                              0.39.0
numexpr                            2.6.8
numpy                              1.18.5
numpydoc                           0.8.0
oauthlib                           3.1.0
odo                                0.5.1
olefile                            0.46
openpyxl                           2.5.6
opt-einsum                         3.3.0
packaging                          17.1
pandas                             0.23.4
pandocfilters                      1.4.2
parsel                             1.5.2
parso                              0.3.1
partd                              0.3.8
path.py                            11.1.0
pathlib2                           2.3.2
patsy                              0.5.0
pep8                               1.7.1
pickleshare                        0.7.4
Pillow                             7.2.0
pip                                20.2.2
pkginfo                            1.4.2
pluggy                             0.7.1
ply                                3.11
prometheus-client                  0.3.1
prompt-toolkit                     1.0.15
proto-plus                         1.11.0
protobuf                           3.13.0
psutil                             5.4.7
py                                 1.6.0
pyasn1                             0.4.8
pyasn1-modules                     0.2.8
pycodestyle                        2.4.0
pycosat                            0.6.3
pycparser                          2.18
pycrypto                           2.6.1
pycurl                             7.43.0.5
PyDispatcher                       2.0.5
pyflakes                           2.0.0
Pygments                           2.2.0
PyHamcrest                         2.0.2
pyLDAvis                           2.1.2
pylint                             2.1.1
pymongo                            3.11.0
PyMouse                            1.0
PyMySQL                            0.10.0
pyodbc                             4.0.24
pyOpenSSL                          19.1.0
pyparsing                          2.2.0
PySocks                            1.6.8
pytest                             3.8.0
pytest-arraydiff                   0.2
pytest-astropy                     0.4.0
pytest-doctestplus                 0.1.3
pytest-openfiles                   0.3.0
pytest-remotedata                  0.3.0
pytest-runner                      5.2
python-dateutil                    2.7.3
python-docx                        0.8.10
pytorch-pretrained-bert            0.6.2
pytz                               2020.4
PyWavelets                         1.0.0
pywin32                            223
pywinpty                           0.5.4
PyYAML                             5.3.1
pyzmq                              17.1.2
qt5reactor                         0.6.3
QtAwesome                          0.4.4
qtconsole                          4.4.1
QtPy                               1.5.0
queuelib                           1.5.0
redis                              3.5.3
regex                              2020.7.14
requests                           2.25.0
requests-oauthlib                  1.3.0
rope                               0.11.0
rsa                                4.6
ruamel-yaml                        0.15.46
s3transfer                         0.2.1
scikit-image                       0.14.0
scikit-learn                       0.19.2
scipy                              1.4.1
Scrapy                             1.6.0
seaborn                            0.9.0
selenium                           3.141.0
Send2Trash                         1.5.0
service-identity                   17.0.0
setuptools                         50.3.2
simplegeneric                      0.8.1
singledispatch                     3.4.0.3
six                                1.15.0
smart-open                         2.1.1
snowballstemmer                    1.2.1
snownlp                            0.12.3
sortedcollections                  1.0.1
sortedcontainers                   2.0.5
Sphinx                             1.7.9
sphinxcontrib-websupport           1.1.0
spyder                             3.3.1
spyder-kernels                     0.2.6
SQLAlchemy                         1.2.11
statsmodels                        0.9.0
stop-words                         2018.7.23
sympy                              1.1.1
tables                             3.4.4
tblib                              1.3.2
tensorboard                        1.14.0
tensorboard-plugin-wit             1.7.0
tensorflow-estimator               1.14.0
tensorflow-gpu                     1.14.0
termcolor                          1.1.0
terminado                          0.8.1
testpath                           0.3.1
text-unidecode                     1.3
toolz                              0.9.0
torch                              1.6.0+cu101
torchvision                        0.7.0+cu101
tornado                            5.1
tqdm                               4.26.0
traitlets                          4.3.2
Twisted                            18.7.0
typing-extensions                  3.7.4.3
typing-inspect                     0.6.0
unicodecsv                         0.14.1
urllib3                            1.25.10
w3lib                              1.21.0
wcwidth                            0.1.7
webencodings                       0.5.1
Werkzeug                           0.14.1
wheel                              0.31.1
widgetsnbextension                 3.4.1
win-inet-pton                      1.0.1
win-unicode-console                0.5
wincertstore                       0.2
wrapt                              1.12.1
xlrd                               1.1.0
XlsxWriter                         1.1.0
xlwings                            0.11.8
xlwt                               1.3.0
zict                               0.1.3
zipp                               3.1.0
zope.interface                     4.5.0

Iterations = 500 gives the same result

from bitermplus.

maximtrp avatar maximtrp commented on September 27, 2024

I have managed to reproduce this bug under MacOS and Windows, but model is being fitted correctly under Linux. I will try to figure out the cause

from bitermplus.

JennieGerhardt avatar JennieGerhardt commented on September 27, 2024

My OS is Win10.
I created a new virtual environment and updated the packages, but it still doesn't work.

from bitermplus.

maximtrp avatar maximtrp commented on September 27, 2024

Thank you again for drawing attention to this problem! It is now fixed, and the new release is available in PyPi.

from bitermplus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.