derrickburns / generalized-kmeans-clustering Goto Github PK

Spark library for generalized K-Means clustering. Supports general Bregman divergences. Suitable for clustering probabilistic data, time series data, high dimensional data, and very large data.

Home Page: https://generalized-kmeans-clustering.massivedatascience.com/

License: Apache License 2.0

Scala 4.62% HTML 93.40% CSS 0.96% JavaScript 1.03%

bregman-divergence clustering cosine-similarity embeddings entropy euclidean-distance itakura-saito-divergence k-means kullback-leibler-divergence similarity-search spark spark-mllib

generalized-kmeans-clustering's People

Contributors

Stargazers

Watchers

Forkers

rmaestre rincaro miguelperalvo thomaztony ebajrami leochencipher vijaykiran hellraiserinchief bigdatafly mikhainin adilakhter semanticbeeng jaimecclin desperado1992 veterun codlife durgaswaroop csjunxu jeromebanks canyo12 briancg42 ahoyosid pmzuk mohnkhan xjqjzgptx diallosky abdounadir alfeuduran ivan-troy kapiya exp-time-series-tools axibaa phymucs shyamsunder0072 tonellotto dora-zyn owen17311528011 jiapengwei lynndisco randomwalker300 weexp sergeystoma shauryashaurya paritosh-101 hbcbh1999 kgirma datajunkie007 jpreiss thearchiver

generalized-kmeans-clustering's Issues

publish on Spark Packages Main Repository

Hi @derrickburns,

Would you like to make a release of this package on the Spark Packages Maven Repo? There is an sbt-plugin called sbt-spark-package that would help you make the release straight from your sbt console. All you need to do is set a couple configurations.

Publishing on the Spark Packages Repo will bump your ranking on the website, and will fill in the How To section, which users can use to include your package in their work.

Please let me know if you have any comments/questions/suggestions!

Best,
Burak

stackoverflow

Hi,
When 'predicting' a single Vector from a RDD[Vector] on a trained model a stackoverflowerror is thrown.
When doing the same on a RDD[Vector] at once it works oke.

println("clustering single vectors fails")
val singleVector = mymatrix.map { point =>
 try {
  val prediction = kModel.predict(point)
  (point.toString, prediction)
 } catch {
  case e: Error => println("unable to predict a single vector")
 }
}
println(s"singleVector.count():${singleVector.count()}")

println("clustering using multiple vectors, this runs oke")
val predictions = kModel.predict(mymatrix)
val multipleVector = predictions.zip(mymatrix).map(point => (point._2.toString, point._1))
println(s"multipleVector.count():${multipleVector.count()}")

I've put my code with data as an example here: https://github.com/bkersbergen/massive-kmeans-overflow.

2015/06/18 11:10:03:300 [ERROR] [Executor task launch worker-5]     org.apache.spark.Logging$class.logError:96 - Exception in task 0.0 in stage 63.0 (TID 31500)
java.lang.StackOverflowError
    at     com.massivedatascience.divergence.SquaredEuclideanDistanceDivergence$.convexHomogeneous    (BregmanDivergence.scala:144)
    at     com.massivedatascience.clusterer.NonSmoothedPointCenterFactory$class.toPoint(BregmanPointO    ps.scala:209)
    at     com.massivedatascience.clusterer.SquaredEuclideanPointOps$.toPoint(BregmanPointOps.scala:260)
    at     com.massivedatascience.clusterer.KMeansPredictor$class.predictWeighted(KMeansModel.scala:66)
    at com.massivedatascience.clusterer.KMeansModel.predictWeighted(KMeansModel.scala:99)

This works on the MLLib kmeans implementation, however switching to massive-kmeans gives the following stackoverflowerror:
(you can switch between import statements MLLib/massivedatascience in the scala file to see the difference)

add topics

I suggest adding the topics clustering, k-mean, kmeans in the About section at https://github.com/derrickburns/generalized-kmeans-clustering

Increase code coverage beyond 60%

Rationale behind WeightedVector

I don't get the rationale behind Weighted Vector, As far as I got, WeightedVector applies the same weight to each Vector element. For example, if I have

val v  = Vectors.dense(1,0.5,3)
val wv = WeightedVector(v,0.5)

wv will be treated as Vector.dense(0.5,0.25,1.5) in terms of clustering, right?
Now, let's say I'm extracting 2 features from data, one feature it's represented by one vector element and the other one is represented by 20 vector elements. Now I want that, for what concerns clustering, both the features have the same weight, so I should weight the first element as 1 and the other 20 as 1/20, right?
I expected this kind of functionality from weighted vector, I don't see the point of WeightedVectors as they are now, but probably is because of my lack of experience about clustering and data mining in general.

derrickburns / generalized-kmeans-clustering Goto Github PK

generalized-kmeans-clustering's People

Contributors

Stargazers

Watchers

Forkers

generalized-kmeans-clustering's Issues

publish on Spark Packages Main Repository

stackoverflow

add topics

Increase code coverage beyond 60%

Rationale behind WeightedVector

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent