issamemari / denstream Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 14.0 12 KB

Python implementation of the data stream clustering algorithm "DenStream"

License: MIT License

Python 100.00%

denstream's Introduction

Hello there 👋 I'm Issa

About me 🚀

🎓 I am a Machine Learning Lead @Luko
❤️ I am passionate about Software Engineering, Machine Learning/Deep Learning, Computer Vision and Music
📜 BSc Informatics and MSc Artificial Intelligence
📫 How to reach me: [email protected]
🏠 Paris (🇫🇷)

denstream's People

Contributors

Stargazers

Watchers

Forkers

0zhongying0 caiyuqiao cswords 2360840182 mgvalverde andreypivi ngoha-nhi alfoadd jieky-1 68wang68 leeoniya mwskywalker sc-one

denstream's Issues

getting empty p_micro_cluster_centers when trying the algorithm with less n_samples (10, 50, 100, 300 ...)

As the title says, I'm struggling trying to decode **why I am getting empty p_micro_cluster_centers when running Test.py setting n_samples to a smaller number **. Any help will be appreciated.

I don't know why but there are not any issues of the kind if the data set you use has much more elements. I tried the algorithm with 500 n_samples and it worked fine.

IMPORTANT: there seems to be a problem in the _partial_fit method when it comes to generate p_micro_clusters ... I think it has to do with the if statement but I do not know what to do about it:

def _partial_fit(self, sample, weight):
     self._merging(sample, weight)
     if self.t % self.tp == 0:
         self.p_micro_clusters = [p_micro_cluster for p_micro_cluster
                                  in self.p_micro_clusters if
                                  p_micro_cluster.weight() >= self.beta *
                                  self.mu]
         Xis = [((self._decay_function(self.t - o_micro_cluster.creation_time
                                       + self.tp) - 1) /
                 (self._decay_function(self.tp) - 1)) for o_micro_cluster in
                self.o_micro_clusters]
         self.o_micro_clusters = [o_micro_cluster for Xi, o_micro_cluster in
                                  zip(Xis, self.o_micro_clusters) if
                                  o_micro_cluster.weight() >= Xi]
     self.t += 1

Thanks in advance!

fit_predict only returns labels for newly added data, is that intended?

Hello. I can see in the code that fit_predict gets an X array of data, adds it to the micro clusters and then runs DBSCAN on them (the micro clusters). However, it then returns labels only for X but not for preexisting data.
A more concrete example of how I want to use the algorithm:
I have multiple days of data.
I run partial_fit using only the 1st day, and this creates some micro clusters.
Then I run partial_fit on the data of the 2nd day, and this updates the existing micro clusters.
etc
If on my final day I want to run fit_predic to get the final clustering result, I have to give all the points (of all the days) as X because the function only labels those points.
Is it how it is intended?

Get items from each cluster

Hi, is there a way to get the elements of each cluster ?. For example, if a cluster has a microclusters and 10 elements. How can I get those 10 elements with their X, Y coordinates?

fixed eps for dbscan when clustering micro clusters?

Hi! I was reading the code and came across this:

file DenStream.py, line 130: dbscan = DBSCAN(eps=0.3, algorithm='brute')

I'm trying to understand the algorithm so here are my questions:

why setting a fixed DBSCAN eps param? shouldn't it vary according to every problem/data set?
if fixed, why 0.3 instead of another value?

Reading the MOA (Java framework) DenStream implementation, I found the following: a constant is multiplied to the original parametrized 'epsilon' to get the value of the DBSCAN epsilon ...

https://github.com/Waikato/moa/blob/master/moa/src/main/java/moa/clusterers/denstream/WithDBSCAN.java

Is there a way to not assign every point to a cluster?

In fit_predict I see the labels are returned. Each individual label is assigned to the nearest microcluster and hence every point has a label. Yet we are using dbscan in this methodology which allows for non-assignment of points. Is there a way to adapt this code to have some points not labeled (e.g. -1) for points which are outliers. In practice, I see that there are some points which are too far away from the centers yet could not reasonably belong to any other microcluster and this should be unassigned.

issamemari / denstream Goto Github PK

denstream's Introduction

Hello there 👋 I'm Issa

About me 🚀

denstream's People

Contributors

Stargazers

Watchers

Forkers

denstream's Issues

getting empty p_micro_cluster_centers when trying the algorithm with less n_samples (10, 50, 100, 300 ...)

fit_predict only returns labels for newly added data, is that intended?

Get items from each cluster

fixed eps for dbscan when clustering micro clusters?

Is there a way to not assign every point to a cluster?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent