lmcinnes / enstop Goto Github PK

View Code? Open in Web Editor NEW

112.0 5.0 12.0 239 KB

Ensemble topic modelling with pLSA

License: BSD 2-Clause "Simplified" License

Python 91.14% Jupyter Notebook 8.86%

topic-modeling plsa dimensionality-reduction matrix-factorization

enstop's People

Contributors

Stargazers

Watchers

Forkers

gokceneraslan mayurmorin stjordanis vishalbelsare zhanglipku fagan2888 cjweir biobenkj timc-workshops w-qilong yu336 weexp

enstop's Issues

Call to plsa_refit fails due to missing sample_weight

When using model.transform() on new unseen data, the following error occurs:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-30746b9aeea8> in <module>
      1 test_corpus = df1['cleaned_text'].tolist()
      2 test_dtm = vectorizer.transform(test_corpus)
----> 3 test_doc_vecs = model.transform(test_dtm)
      4 labels = np.argmax(test_doc_vecs, axis=1)
      5 

/opt/conda/lib/python3.7/site-packages/enstop/enstop_.py in transform(self, X, y)
    836             n_iter_per_test=5,
    837             tolerance=0.001,
--> 838             random_state=random_state,
    839         )
    840 

TypeError: plsa_refit() missing 1 required positional argument: 'sample_weight'

There seems to be a missing arg here.

Seems a simple fix - I would be happy to make a PR, but I am not sure how to derive the needed arg:

    sample_weight: array of shape (n_docs,)
        Input document weights.

If @lmcinnes you can shed some light here - could be a quick fix!

AttributeError: module 'dask' has no attribute 'delayed'

When I am running the following code:

ens_model = EnsembleTopics(n_components=20, n_starts=8, n_jobs=2).fit(data_vec)

I get the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<timed exec> in <module>

d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in fit(self, X, y)
    719         self
    720         """
--> 721         self.fit_transform(X)
    722         return self
    723 

d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in fit_transform(self, X, y)
    763             self.alpha,
    764             self.solver,
--> 765             self.random_state,
    766         )
    767         self.components_ = V

d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in ensemble_fit(X, estimated_n_topics, model, init, min_samples, min_cluster_size, n_starts, n_jobs, parallelism, topic_combination, n_iter, n_iter_per_test, tolerance, e_step_thresh, lift_factor, beta_loss, alpha, solver, random_state)
    507         alpha=alpha,
    508         solver=solver,
--> 509         random_state=random_state,
    510     )
    511 

d:\pycharmprojects\biclustering\venv\lib\site-packages\enstop\enstop_.py in ensemble_of_topics(X, k, model, n_jobs, n_runs, parallelism, **kwargs)
    181 
    182     if parallelism == "dask":
--> 183         dask_topics = dask.delayed(create_topics)
    184         staged_topics = [dask_topics(X, k, **kwargs) for i in range(n_runs)]
    185         topics = dask.compute(*staged_topics, scheduler="threads", num_workers=n_jobs)

AttributeError: module 'dask' has no attribute 'delayed'

data_vec is a vector:

data_vec = CountVectorizer().fit_transform(data)
I cannot run any version of EnsembleTopics.

Could you please help?
I am using Python 3.7.5 x64. Windows 10.

HDBSCAN error stopping EnsembleTopics

The code from your homepage

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from enstop import EnsembleTopics

news = fetch_20newsgroups(subset='all')
data = CountVectorizer().fit_transform(news.data)

model = EnsembleTopics(n_components=20).fit(data)
topics = model.components_
doc_vectors = model.embedding_

results in an error:
File hdbscan\_hdbscan_tree.pyx:659, in hdbscan._hdbscan_tree.get_clusters()

File hdbscan\_hdbscan_tree.pyx:733, in hdbscan._hdbscan_tree.get_clusters()

TypeError: 'numpy.float64' object cannot be interpreted as an integer

I have sklearn 1.3.0, Python 3.11.4

Get coherence score for PLSA

PLSA and other methods gives strange coherence score:

PLSA(n_components=3).fit(data_vec).coherence()
PLSA(n_components=4).fit(data_vec).coherence()

n=5, -894.0931521853117
n=4, -846.5056881515624
n=1000, -548.1772075123278

When I use gensim, I get quite a good score:

2 & 0.4492 
3 & 0.4257 
4 & 0.4308 
5 & 0.4443 
6 & 0.4625 
7 & 0.455 
8 & 0.4791 
9 & 0.4897 
10 & 0.5354 
11 & 0.5165 
12 & 0.5149 
13 & 0.5382 
14 & 0.5546 
15 & 0.5669 
16 & 0.5633 
17 & 0.5323

Could you please tell whether there is a bug?

FloatingPointError. NMF

I can't run NMF algorithm. When I run:

%%time
nmf_model = NMF(n_components=20, beta_loss='kullback-leibler', solver='mu').fit(data)

... I see the following error stack :

---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
<timed exec> in <module>

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in fit(self, X, y, **params)
   1310         self
   1311         """
-> 1312         self.fit_transform(X, **params)
   1313         return self
   1314 

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in fit_transform(self, X, y, W, H)
   1285             l1_ratio=self.l1_ratio, regularization='both',
   1286             random_state=self.random_state, verbose=self.verbose,
-> 1287             shuffle=self.shuffle)
   1288 
   1289         self.reconstruction_err_ = _beta_divergence(X, W, H, self.beta_loss,

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in non_negative_factorization(X, W, H, n_components, init, update_H, solver, beta_loss, tol, max_iter, alpha, l1_ratio, regularization, random_state, verbose, shuffle)
   1067                                                   tol, l1_reg_W, l1_reg_H,
   1068                                                   l2_reg_W, l2_reg_H, update_H,
-> 1069                                                   verbose)
   1070 
   1071     else:

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in _fit_multiplicative_update(X, W, H, beta_loss, max_iter, tol, l1_reg_W, l1_reg_H, l2_reg_W, l2_reg_H, update_H, verbose)
    810         if update_H:
    811             delta_H = _multiplicative_update_h(X, W, H, beta_loss, l1_reg_H,
--> 812                                                l2_reg_H, gamma)
    813             H *= delta_H
    814 

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in _multiplicative_update_h(X, W, H, beta_loss, l1_reg_H, l2_reg_H, gamma)
    634     else:
    635         # Numerator
--> 636         WH_safe_X = _special_sparse_dot(W, H, X)
    637         if sp.issparse(X):
    638             WH_safe_X_data = WH_safe_X.data

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_nmf.py in _special_sparse_dot(W, H, X)
    178             batch = slice(start, start + batch_size)
    179             dot_vals[batch] = np.multiply(W[ii[batch], :],
--> 180                                           H.T[jj[batch], :]).sum(axis=1)
    181 
    182         WH = sp.coo_matrix((dot_vals, (ii, jj)), shape=X.shape)

FloatingPointError: underflow encountered in multiply

I also have the same error for LatentDirichletAllocation if I choose 448 clusters for 25000 rows:

%%time
lda_model = LatentDirichletAllocation(n_components=448).fit(data_vec)

---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
<timed exec> in <module>

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in fit(self, X, y)
    566                     # batch update
    567                     self._em_step(X, total_samples=n_samples,
--> 568                                   batch_update=True, parallel=parallel)
    569 
    570                 # check perplexity

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in _em_step(self, X, total_samples, batch_update, parallel)
    446         # E-step
    447         _, suff_stats = self._e_step(X, cal_sstats=True, random_init=True,
--> 448                                      parallel=parallel)
    449 
    450         # M-step

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in _e_step(self, X, cal_sstats, random_init, parallel)
    399                                               self.mean_change_tol, cal_sstats,
    400                                               random_state)
--> 401             for idx_slice in gen_even_slices(X.shape[0], n_jobs))
    402 
    403         # merge result

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
   1001             # remaining jobs.
   1002             self._iterating = False
-> 1003             if self.dispatch_one_batch(iterator):
   1004                 self._iterating = self._original_iterator is not None
   1005 

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    832                 return False
    833             else:
--> 834                 self._dispatch(tasks)
    835                 return True
    836 

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    751         with self._lock:
    752             job_idx = len(self._jobs)
--> 753             job = self._backend.apply_async(batch, callback=cb)
    754             # A job can complete so quickly than its callback is
    755             # called before we get here, causing self._jobs to

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    199     def apply_async(self, func, callback=None):
    200         """Schedule a func to be run"""
--> 201         result = ImmediateResult(func)
    202         if callback:
    203             callback(result)

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    580         # Don't delay the application, to avoid keeping the input
    581         # arguments in memory
--> 582         self.results = batch()
    583 
    584     def get(self):

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in __call__(self)
    254         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    255             return [func(*args, **kwargs)
--> 256                     for func, args, kwargs in self.items]
    257 
    258     def __len__(self):

d:\pycharmprojects\biclustering\venv\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    254         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    255             return [func(*args, **kwargs)
--> 256                     for func, args, kwargs in self.items]
    257 
    258     def __len__(self):

d:\pycharmprojects\biclustering\venv\lib\site-packages\sklearn\decomposition\_online_lda.py in _update_doc_distribution(X, exp_topic_word_distr, doc_topic_prior, max_iters, mean_change_tol, cal_sstats, random_state)
    115 
    116             doc_topic_d = (exp_doc_topic_d *
--> 117                            np.dot(cnts / norm_phi, exp_topic_word_d.T))
    118             # Note: adds doc_topic_prior to doc_topic_d, in-place.
    119             _dirichlet_expectation_1d(doc_topic_d, doc_topic_prior,

FloatingPointError: underflow encountered in multiply

Could you please help?
I am using Python 3.7.5 x64. Windows 10.

How so I extract the common topic of a cluster of texts?

Hi @lmcinnes
thanks for this nice code here...
I am looking for a solution for the following task:
I have a cluster of small texts and want to extract the common topic of them.
The "headline" above them so to say.

Do you have an hint for me on how to solve this or what / where to read?

Thanks
Philip

Integration with pyLDAvis

Dear Leland,
I tried to use pyLDAvis with enstop, following same API as the sklearn topic models. I did essentially what is shown here https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb and replaced scikit-learn's LatentDirichletAllocation with enstop's EnsembleTopics.
I got this error:

ValidationError: 
 * Not all rows (distributions) in doc_topic_dists sum to 1.

Can you help to sort it out?
Thank you in advance

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.