lmcinnes / enstop Goto Github PK
View Code? Open in Web Editor NEWEnsemble topic modelling with pLSA
License: BSD 2-Clause "Simplified" License
Ensemble topic modelling with pLSA
License: BSD 2-Clause "Simplified" License
When using model.transform()
on new unseen data, the following error occurs:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-42-30746b9aeea8> in <module>
1 test_corpus = df1['cleaned_text'].tolist()
2 test_dtm = vectorizer.transform(test_corpus)
----> 3 test_doc_vecs = model.transform(test_dtm)
4 labels = np.argmax(test_doc_vecs, axis=1)
5
/opt/conda/lib/python3.7/site-packages/enstop/enstop_.py in transform(self, X, y)
836 n_iter_per_test=5,
837 tolerance=0.001,
--> 838 random_state=random_state,
839 )
840
TypeError: plsa_refit() missing 1 required positional argument: 'sample_weight'
There seems to be a missing arg here.
Seems a simple fix - I would be happy to make a PR, but I am not sure how to derive the needed arg:
sample_weight: array of shape (n_docs,)
Input document weights.
If @lmcinnes you can shed some light here - could be a quick fix!
When I am running the following code:
ens_model = EnsembleTopics(n_components=20, n_starts=8, n_jobs=2).fit(data_vec)
I get the error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<timed exec> in <module>
d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in fit(self, X, y)
719 self
720 """
--> 721 self.fit_transform(X)
722 return self
723
d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in fit_transform(self, X, y)
763 self.alpha,
764 self.solver,
--> 765 self.random_state,
766 )
767 self.components_ = V
d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in ensemble_fit(X, estimated_n_topics, model, init, min_samples, min_cluster_size, n_starts, n_jobs, parallelism, topic_combination, n_iter, n_iter_per_test, tolerance, e_step_thresh, lift_factor, beta_loss, alpha, solver, random_state)
507 alpha=alpha,
508 solver=solver,
--> 509 random_state=random_state,
510 )
511
d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in ensemble_of_topics(X, k, model, n_jobs, n_runs, parallelism, **kwargs)
181
182 if parallelism == "dask":
--> 183 dask_topics = dask.delayed(create_topics)
184 staged_topics = [dask_topics(X, k, **kwargs) for i in range(n_runs)]
185 topics = dask.compute(*staged_topics, scheduler="threads", num_workers=n_jobs)
AttributeError: module 'dask' has no attribute 'delayed'
data_vec is a vector:
data_vec = CountVectorizer().fit_transform(data)
I cannot run any version of EnsembleTopics.
Could you please help?
I am using Python 3.7.5 x64. Windows 10.
The code from your homepage
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from enstop import EnsembleTopics
news = fetch_20newsgroups(subset='all')
data = CountVectorizer().fit_transform(news.data)
model = EnsembleTopics(n_components=20).fit(data)
topics = model.components_
doc_vectors = model.embedding_
results in an error:
File hdbscan\_hdbscan_tree.pyx:659, in hdbscan._hdbscan_tree.get_clusters()
File hdbscan\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()
TypeError: 'numpy.float64' object cannot be interpreted as an integer
I have sklearn 1.3.0, Python 3.11.4
PLSA and other methods gives strange coherence score:
PLSA(n_components=3).fit(data_vec).coherence()
PLSA(n_components=4).fit(data_vec).coherence()
n=5, -894.0931521853117
n=4, -846.5056881515624
n=1000, -548.1772075123278
When I use gensim, I get quite a good score:
2 & 0.4492
3 & 0.4257
4 & 0.4308
5 & 0.4443
6 & 0.4625
7 & 0.455
8 & 0.4791
9 & 0.4897
10 & 0.5354
11 & 0.5165
12 & 0.5149
13 & 0.5382
14 & 0.5546
15 & 0.5669
16 & 0.5633
17 & 0.5323
Could you please tell whether there is a bug?
I can't run NMF algorithm. When I run:
%%time
nmf_model = NMF(n_components=20, beta_loss='kullback-leibler', solver='mu').fit(data)
... I see the following error stack :
---------------------------------------------------------------------------
FloatingPointError Traceback (most recent call last)
<timed exec> in <module>
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in fit(self, X, y, **params)
1310 self
1311 """
-> 1312 self.fit_transform(X, **params)
1313 return self
1314
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in fit_transform(self, X, y, W, H)
1285 l1_ratio=self.l1_ratio, regularization='both',
1286 random_state=self.random_state, verbose=self.verbose,
-> 1287 shuffle=self.shuffle)
1288
1289 self.reconstruction_err_ = _beta_divergence(X, W, H, self.beta_loss,
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in non_negative_factorization(X, W, H, n_components, init, update_H, solver, beta_loss, tol, max_iter, alpha, l1_ratio, regularization, random_state, verbose, shuffle)
1067 tol, l1_reg_W, l1_reg_H,
1068 l2_reg_W, l2_reg_H, update_H,
-> 1069 verbose)
1070
1071 else:
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in _fit_multiplicative_update(X, W, H, beta_loss, max_iter, tol, l1_reg_W, l1_reg_H, l2_reg_W, l2_reg_H, update_H, verbose)
810 if update_H:
811 delta_H = _multiplicative_update_h(X, W, H, beta_loss, l1_reg_H,
--> 812 l2_reg_H, gamma)
813 H *= delta_H
814
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in _multiplicative_update_h(X, W, H, beta_loss, l1_reg_H, l2_reg_H, gamma)
634 else:
635 # Numerator
--> 636 WH_safe_X = _special_sparse_dot(W, H, X)
637 if sp.issparse(X):
638 WH_safe_X_data = WH_safe_X.data
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in _special_sparse_dot(W, H, X)
178 batch = slice(start, start + batch_size)
179 dot_vals[batch] = np.multiply(W[ii[batch], :],
--> 180 H.T[jj[batch], :]).sum(axis=1)
181
182 WH = sp.coo_matrix((dot_vals, (ii, jj)), shape=X.shape)
FloatingPointError: underflow encountered in multiply
I also have the same error for LatentDirichletAllocation if I choose 448 clusters for 25000 rows:
%%time
lda_model = LatentDirichletAllocation(n_components=448).fit(data_vec)
---------------------------------------------------------------------------
FloatingPointError Traceback (most recent call last)
<timed exec> in <module>
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in fit(self, X, y)
566 # batch update
567 self._em_step(X, total_samples=n_samples,
--> 568 batch_update=True, parallel=parallel)
569
570 # check perplexity
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in _em_step(self, X, total_samples, batch_update, parallel)
446 # E-step
447 _, suff_stats = self._e_step(X, cal_sstats=True, random_init=True,
--> 448 parallel=parallel)
449
450 # M-step
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in _e_step(self, X, cal_sstats, random_init, parallel)
399 self.mean_change_tol, cal_sstats,
400 random_state)
--> 401 for idx_slice in gen_even_slices(X.shape[0], n_jobs))
402
403 # merge result
d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1001 # remaining jobs.
1002 self._iterating = False
-> 1003 if self.dispatch_one_batch(iterator):
1004 self._iterating = self._original_iterator is not None
1005
d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
832 return False
833 else:
--> 834 self._dispatch(tasks)
835 return True
836
d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
751 with self._lock:
752 job_idx = len(self._jobs)
--> 753 job = self._backend.apply_async(batch, callback=cb)
754 # A job can complete so quickly than its callback is
755 # called before we get here, causing self._jobs to
d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
199 def apply_async(self, func, callback=None):
200 """Schedule a func to be run"""
--> 201 result = ImmediateResult(func)
202 if callback:
203 callback(result)
d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
580 # Don't delay the application, to avoid keeping the input
581 # arguments in memory
--> 582 self.results = batch()
583
584 def get(self):
d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in __call__(self)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in _update_doc_distribution(X, exp_topic_word_distr, doc_topic_prior, max_iters, mean_change_tol, cal_sstats, random_state)
115
116 doc_topic_d = (exp_doc_topic_d *
--> 117 np.dot(cnts / norm_phi, exp_topic_word_d.T))
118 # Note: adds doc_topic_prior to doc_topic_d, in-place.
119 _dirichlet_expectation_1d(doc_topic_d, doc_topic_prior,
FloatingPointError: underflow encountered in multiply
Could you please help?
I am using Python 3.7.5 x64. Windows 10.
Hi @lmcinnes
thanks for this nice code here...
I am looking for a solution for the following task:
I have a cluster of small texts and want to extract the common topic of them.
The "headline" above them so to say.
Do you have an hint for me on how to solve this or what / where to read?
Thanks
Philip
Dear Leland,
I tried to use pyLDAvis with enstop, following same API as the sklearn topic models. I did essentially what is shown here https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb and replaced scikit-learn's LatentDirichletAllocation with enstop's EnsembleTopics.
I got this error:
ValidationError:
* Not all rows (distributions) in doc_topic_dists sum to 1.
Can you help to sort it out?
Thank you in advance
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.