Only a single CPU core is used about hdbscan HOT 12 CLOSED

commented on May 17, 2024

Only a single CPU core is used

from hdbscan.

Comments (12)

lmcinnes commented on May 17, 2024

The algorithm is not easily parallelizable, so for the most part it will
indeed only use one core. Where easy parallelism was available I have made
use of it, hence the apparent use of other cores. The majority of the work,
however, is quite hard to parallelise in any reasonable way, and my focus
was on data structures and algorithms that could significantly improve
single core performance. There is some hope that Dask and a suitably
dask-enabled space tree might enable some better parallel performance, but
that is some ways off from being developed.

On Mon, Jul 25, 2016 at 7:47 PM, Dringite [email protected] wrote:

Not sure if this is a bug or feature, but I have observed that on my
Ubuntu 14.04 machine HDBSCAN only ever uses one core, some other cores also
spike occasionally but 90% of the time it's just a single core at 100% and
all the others 0%.

Is this algorithm not parallelizable? Or has it not been done yet?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/lmcinnes/hdbscan/issues/49, or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBXN6nlrr1BefTDqFXsnnEPsf4Jlvks5qZUsQgaJpZM4JUqCJ
.

from hdbscan.

commented on May 17, 2024

I also noticed that by simply shuffling the dataset I'm able to get different cluster counts. So what about introducing the option to train an ensemble (multiple instances) in parallel? You can of course do this using python tricks yourself, but those are never pleasant to orchestrate.

from hdbscan.

lmcinnes commented on May 17, 2024

Hmmm, that's actually slightly less than ideal. Can you use the
approx_min_spanning_tree=False and see if that remedies the variability
upon shuffling? Also, how much variability are you seeing? A little bit, or
large swings in the number and nature of clusters?

On Mon, Jul 25, 2016 at 9:53 PM, Dringite [email protected] wrote:

I also noticed that by simply shuffling the dataset I'm able to get
different cluster counts. So what about introducing the option to train an
ensemble (multiple instances) in parallel? You can of course do this using
python tricks yourself, but those are never pleasant to orchestrate.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/lmcinnes/hdbscan/issues/49#issuecomment-235140261,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBe5zcG2OkTCdv1k6YW3Qyuc8F1e7ks5qZWi3gaJpZM4JUqCJ
.

from hdbscan.

commented on May 17, 2024

Now that I have started to explore my dataset in more depth using this tool I have more problems to deal with.
I have a very noisy dataset with 20 dimensions and 100k samples.
HDBSCAN(min_cluster_size=32, min_samples=1, alpha=1.0)
Produces only 4 clusters, 3 of them containing almost nothing. With about 15k classed as noise.
I guess I will have to go and play with the alpha parameter now.

This is somewhat surprising, I was expecting great things from HDBSCAN after the 2D demos, but so far it looks like it has serious problems detecting structure in just 20 dimensions.

Prior to this I used self-organizing maps and I can tell you that it had no problem detecting a lot of structure in this dataset, it was just very slow.

Edit: I think the noise must be providing a lot of bridges, it's a problem DBSCAN had. Is there a parameter to try and solve that?

Edit2: Alpha really doesn't make any difference at all, same results with 1, 1.5, 2, 4, 8 and 1024.

from hdbscan.

lmcinnes commented on May 17, 2024

I actually doubt alpha will give you much. A more useful approach may be to
use some dimension reduction (t-SNE is good) down to 2-10 dimensions and
see if that helps at all (as that will be the kind of structure the self
organzing maps was pulling out). I would actually argue that perhaps the
data is less structured than you thought -- at least as far as clusters go
(there may be geometric/topological structure beyond pi_0 that may be of
interest for example). I would also suggest that you try increasing
min_samples. More of your data will be classified as noise, but you will be
more likely to find any finer structure in clusters.

On Mon, Jul 25, 2016 at 11:17 PM, Dringite [email protected] wrote:

Now that I have started to explore my dataset in more depth using this
tool I have more problems to deal with.
I have a very noisy dataset with 20 dimensions and 100k samples.
HDBSCAN(min_cluster_size=32, min_samples=1, alpha=1.0)
Produces only 4 clusters, 3 of them containing almost nothing. With about
15k classed as noise.
I guess I will have to go and play with the alpha parameter now.

This is somewhat surprising, I was expecting great things from HDBSCAN
after the 2D demos, but so far it looks like it has serious problems
detecting structure in mere 20 dimensions.

Prior to this I used self-organizing maps and I can tell you that it had
no problem detecting a lot of structure in this dataset, it was just very
slow.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/lmcinnes/hdbscan/issues/49#issuecomment-235151659,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBRf7JEZlrat4zoNjCZLX_mHU-HIAks5qZXw5gaJpZM4JUqCJ
.

from hdbscan.

lmcinnes commented on May 17, 2024

I would also point out that I have used HDBSCAN quite successfully on
50-100 dimensional data, so it is not simply the case that it doesn`t work
in higher dimensions. I would be interestyed in your dataset if you can
share it (I understand that is not the case in many situations).

On Mon, Jul 25, 2016 at 11:17 PM, Dringite [email protected] wrote:

Now that I have started to explore my dataset in more depth using this
tool I have more problems to deal with.
I have a very noisy dataset with 20 dimensions and 100k samples.
HDBSCAN(min_cluster_size=32, min_samples=1, alpha=1.0)
Produces only 4 clusters, 3 of them containing almost nothing. With about
15k classed as noise.
I guess I will have to go and play with the alpha parameter now.

This is somewhat surprising, I was expecting great things from HDBSCAN
after the 2D demos, but so far it looks like it has serious problems
detecting structure in mere 20 dimensions.

Prior to this I used self-organizing maps and I can tell you that it had
no problem detecting a lot of structure in this dataset, it was just very
slow.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/lmcinnes/hdbscan/issues/49#issuecomment-235151659,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBRf7JEZlrat4zoNjCZLX_mHU-HIAks5qZXw5gaJpZM4JUqCJ
.

from hdbscan.

commented on May 17, 2024

The dataset is unfortunately confidential. Every dimension has been normalized to 0..1 with mean 0.5 +- 0.02.

I'm curious about your comment on SOM, what is your opinion of it in general? I have studied its inner workings but I don't have the mathematical grounding to understand why it works, it uses a distance function to establish nodes on a lattice that share similarity to the data, but I don't know why it works so much better than every other algorithm - in this case even HDBSCAN.

I followed your suggestion and it really doesn't matter what parameters I choose, if min_samples is low then over 90% of the data is noise, if it's high then over 90% is in one cluster. Also the probabilities are always near 100% or 0% (for the noise).
Perhaps what you said about the data not having clusters is true, but SOM identifies at least 7 hot spots that contain significant fractions of the dataset. I think that HDBSCAN is unable to deal with bridges and just follows the sparse points and merges the hot spots into a cluster. Is it not capable of classifying the bridges (which are made of sparse points) as noise?

from hdbscan.

lmcinnes commented on May 17, 2024

I understand with regard to the data; that is often the case.

So it is possible that there are clusters and HDBSCAN doesn't find them. My
goal in clustering is "Don't be wrong" -- don't claim there are clusters
when there aren't, because that can lead to bad intuitions about the data.
That means that I like somewhat conservative algorithms. That being said,
HDBSCAN can deal with bridges -- a higher setting of min_samples will break
bridges, unless the bridges are just as dense as the cluster they are
bridging (at which point I am reluctant to call them separate clusters).

The goal with tSNE is that, if there is some underlying manifold structure
to your data, you can bring out that structure more clearly (in a lower
dimension space) and this can aid HDBSCAN in identifying the clusters.
Since SOM is essentially doing a dimension reduction and you are seeing
structure there then it may be a case that manifold structure exists and is
important. I would trust tSNE more than SOM with regard to such a reduction
(although you may want to check the final KL divergence you get with tSNE
just to be sure).

I'm tend to view SOM as a dimension reduction technique, and as with many
such techniques it is not especially conservative (and with SOM it has a
slight bias toward clumpiness, as PCA has a bias toward fan like
structures, and Isomap toward loop like manifold structures). I would want
to verify the results it gives carefully. It may be worth experimenting
with other dimension reduction techniques to see if they reveal similar
patterns (I would recommend trying RobustPCA, Isomap, LLE, and tSNE).

On Tue, Jul 26, 2016 at 12:32 PM, Dringite [email protected] wrote:

The dataset is unfortunately confidential. Every dimension has been
normalized to 0..1 with mean 0.5 +- 0.02.
I must confess that I'm a high level user of such tools as this, I make
only rudimentary reading of how they function, enough to gauge importance
of parameters and little else.
What is the purpose of using tSNE in combination with HDBSCAN? I was
actually planning on trying tSNE next but not in conjunction with HDBSCAN.

I'm also curious about your comment on SOM, what is your opinion of it in
general? I have studied its inner workings but I don't have the
mathematical grounding to understand why it works, it uses euclidean
distance to establish nodes on a lattice that share similarity to the data,
but I really don't know why it works so much better than every other
algorithm - in this case even HDBSCAN.

I followed your suggestion and it really doesn't matter what parameters I
choose, if min_samples is low then over 90% of the data is noise, if it's
high then over 90% is in one cluster. Also the probabilities are always
near 100% or 0% (for the noise).
Perhaps what you said about the data not having clusters is true, but SOM
identifies at least 7 hot spots that contain significant fractions of the
dataset. I think that HDBSCAN is unable to deal with bridges and just
follows the sparse points and merges the hot spots into a cluster. Is it
not capable of classifying the bridges (which are made of sparse points) as
noise?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/lmcinnes/hdbscan/issues/49#issuecomment-235325734,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBfoNQYrON7YM-qmedkpP4lfwh6fSks5qZjargaJpZM4JUqCJ
.

from hdbscan.

commented on May 17, 2024

I started getting some results with min_cluster_size=8, min_samples=1, as in over 1000 clusters with a surprisingly even distribution (with noise being the biggest but that is expected). But then as soon as I changed min_samples to 2 it went back to 1 big cluster, this is not very intuitive as to why this is happening... I think I will just end up doing a grid search to maximize histogram distribution.

It is remarkably difficult to find a good implementation of t-SNE with a python wrappings, wow.
sklearn implementation gets MemoryError with 16gb RAM on a mere (35000,20) - how is that even useful? Seems like it was designed solely for the Iris dataset. The pip tsne implementation works fine but does not let you vary the number of iterations - so it ends with a high error of 4.4 and a big blob result despite the error being on a decreasing trajectory. I would not have expected such a mess for what's touted to be a rather popular algorithm, t-sne that is.

from hdbscan.

lmcinnes commented on May 17, 2024

The package tsne is the reference package (in C++ with bindings to Python
and R) from the original authors; it ought to be about as good as things
come. I agree that the sklearn implementation currently leaves a little to
be desired.

A quick look at the source of the tsne package shows me that you can simply
alter the max_iter value in tsne/bh_tsne_src/bh_tsne.cpp, line 40 and
recompile. Not exactly ideal I must admit, but if you really want more
iterations that is one way.

As to intuition on the parameters: as you increase min_samples you raise
the bar on how dense connecting bridges have to be. I would suggest a large
min_samples value and then possibly look at the condensed tree plot if you
are only seeing a small number of clusters and see how and where your
cluster structure is breaking down. Ostensibly if there are few clusters it
is because most clusters don't persist for much of a range of distance
scales.

On Tue, Jul 26, 2016 at 3:26 PM, Dringite [email protected] wrote:

I started getting some results with min_cluster_size=8, min_samples=1, as
in over 1000 of clusters. But then as soon as I changed min_samples to 2 it
went back to 1 big cluster, this is not very intuitive as to why this is
happening... I think I will just end up doing a grid search to maximize
histogram distribution.

It is remarkably difficult to find a good implementation of t-SNE with a
python wrappings, wow.
sklearn implementation gets MemoryError with 16gb RAM on a mere (35000,20)

how is that even useful? Seems like it was designed solely for the Iris
dataset. The pip tsne implementation works fine but does not let you vary
the number of iterations - so it ends with a high error of 4.4 and a big
blob result despite the error being on a decreasing trajectory.
I would not have expected such a mess for what's touted to be a rather
popular algorithm.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/lmcinnes/hdbscan/issues/49#issuecomment-235377561,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBSWO_c5f3tNwrg3YqpH761PFKdSUks5qZl9ugaJpZM4JUqCJ
.

from hdbscan.

commented on May 17, 2024

Thanks for the suggestion, I will just fork it, make the changes to it and then install it with pip git url, lazy paths are optimal ;)

As for min_samples you are of course correct, but then how can increasing that value actually decrease the number of clusters? Since in theory the fewer bridges the more clusters, and lower min_samples means more bridges - and yet it gives me more clusters. Hence why I said it's somewhat un-intuitive.

from hdbscan.

lmcinnes commented on May 17, 2024

The number of clusters falls out of the optimization game at the end where
we attempt to find the most stable set of clusters (over varying distance
scales). A higher min_samples can cause things to break up more, but that
may, in fact, lead to clusters being much less stable over varying distance
scales, and hence the new "most stable" result is a single parent cluster
of the previous clusters which now fall apart quickly. This can be a little
unintuitive, and thus I suggest you look to the condensed tree plots when
this sort of thing happens to see what is going on. It does imply to me
that your clusters are generally not all that stable, which would explain
why HDBSCAN is mostly giving a single cluster. I suspect the condensed tree
is is very fragmented.

On Tue, Jul 26, 2016 at 4:09 PM, Dringite [email protected] wrote:

Thanks for the suggestion, I will just fork it, make the changes to it and
then install it with pip git url, lazy paths are optimal ;)

As for min_samples you are of course correct, but then how can increasing
that value actually decrease the number of clusters? Since in theory the
fewer bridges the more clusters, and lower min_samples means more bridges -
and yet it gives me more clusters. Hence why I said it's somewhat
un-intuitive.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/lmcinnes/hdbscan/issues/49#issuecomment-235389163,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBagAZEAEOB1lO1_l4Z_zCnRC0HOMks5qZml5gaJpZM4JUqCJ
.

from hdbscan.

Only a single CPU core is used about hdbscan HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent