Coder Social home page Coder Social logo

derrickburns / generalized-kmeans-clustering Goto Github PK

View Code? Open in Web Editor NEW
297.0 297.0 49.0 7.6 MB

Spark library for generalized K-Means clustering. Supports general Bregman divergences. Suitable for clustering probabilistic data, time series data, high dimensional data, and very large data.

Home Page: https://generalized-kmeans-clustering.massivedatascience.com/

License: Apache License 2.0

Scala 4.62% HTML 93.40% CSS 0.96% JavaScript 1.03%
bregman-divergence clustering cosine-similarity embeddings entropy euclidean-distance itakura-saito-divergence k-means kullback-leibler-divergence similarity-search spark spark-mllib

generalized-kmeans-clustering's People

Contributors

derrickburns avatar gitter-badger avatar jpreiss avatar waffle-iron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

generalized-kmeans-clustering's Issues

publish on Spark Packages Main Repository

Hi @derrickburns,

Would you like to make a release of this package on the Spark Packages Maven Repo? There is an sbt-plugin called sbt-spark-package that would help you make the release straight from your sbt console. All you need to do is set a couple configurations.

Publishing on the Spark Packages Repo will bump your ranking on the website, and will fill in the How To section, which users can use to include your package in their work.

Please let me know if you have any comments/questions/suggestions!

Best,
Burak

stackoverflow

Hi,
When 'predicting' a single Vector from a RDD[Vector] on a trained model a stackoverflowerror is thrown.
When doing the same on a RDD[Vector] at once it works oke.

println("clustering single vectors fails")
val singleVector = mymatrix.map { point =>
 try {
  val prediction = kModel.predict(point)
  (point.toString, prediction)
 } catch {
  case e: Error => println("unable to predict a single vector")
 }
}
println(s"singleVector.count():${singleVector.count()}")

println("clustering using multiple vectors, this runs oke")
val predictions = kModel.predict(mymatrix)
val multipleVector = predictions.zip(mymatrix).map(point => (point._2.toString, point._1))
println(s"multipleVector.count():${multipleVector.count()}")

I've put my code with data as an example here: https://github.com/bkersbergen/massive-kmeans-overflow.

2015/06/18 11:10:03:300 [ERROR] [Executor task launch worker-5]     org.apache.spark.Logging$class.logError:96 - Exception in task 0.0 in stage 63.0 (TID 31500)
java.lang.StackOverflowError
    at     com.massivedatascience.divergence.SquaredEuclideanDistanceDivergence$.convexHomogeneous    (BregmanDivergence.scala:144)
    at     com.massivedatascience.clusterer.NonSmoothedPointCenterFactory$class.toPoint(BregmanPointO    ps.scala:209)
    at     com.massivedatascience.clusterer.SquaredEuclideanPointOps$.toPoint(BregmanPointOps.scala:260)
    at     com.massivedatascience.clusterer.KMeansPredictor$class.predictWeighted(KMeansModel.scala:66)
    at com.massivedatascience.clusterer.KMeansModel.predictWeighted(KMeansModel.scala:99)

This works on the MLLib kmeans implementation, however switching to massive-kmeans gives the following stackoverflowerror:
(you can switch between import statements MLLib/massivedatascience in the scala file to see the difference)

Rationale behind WeightedVector

I don't get the rationale behind Weighted Vector, As far as I got, WeightedVector applies the same weight to each Vector element. For example, if I have

val v  = Vectors.dense(1,0.5,3)
val wv = WeightedVector(v,0.5)

wv will be treated as Vector.dense(0.5,0.25,1.5) in terms of clustering, right?
Now, let's say I'm extracting 2 features from data, one feature it's represented by one vector element and the other one is represented by 20 vector elements. Now I want that, for what concerns clustering, both the features have the same weight, so I should weight the first element as 1 and the other 20 as 1/20, right?
I expected this kind of functionality from weighted vector, I don't see the point of WeightedVectors as they are now, but probably is because of my lack of experience about clustering and data mining in general.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.