Coder Social home page Coder Social logo

saurfang / spark-tsne Goto Github PK

View Code? Open in Web Editor NEW
160.0 13.0 37.0 99 KB

Distributed t-SNE via Apache Spark

Home Page: https://saurfang.github.io/spark-tsne-demo/tsne-pixi.html

License: Apache License 2.0

Scala 83.25% R 3.69% HTML 13.07%
spark tsne

spark-tsne's Introduction

spark-tsne

Join the chat at https://gitter.im/saurfang/spark-tsne Build Status Distributed t-SNE with Apache Spark. WIP...

t-SNE is a dimension reduction technique that is particularly good for visualizing high dimensional data. This is an attempt to implement this algorithm using Spark to leverage distributed computing power.

The project is still in progress of replicating reference implementations from the original papers. Spark specific optimizations will be the next goal once the correctness is verified.

Currently I'm showcasing this using the standard MNIST handwriting recognition dataset. I have created a WebGL player (built using pixi.js) to visualize the inner workings as well as the final results of t-SNE. If a WebGL is unavailable for you, you may checkout the d3.js player instead.

Credits

spark-tsne's People

Contributors

erwinvaneijk avatar gitter-badger avatar saurfang avatar zhensongqian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-tsne's Issues

Building Spark-TSNE Behind Corporate Firewall

Hi,

I am unable to download plugins etc from behind the firewall but I am very interested in looking at this code. Is it possible to provide an assembled jar? I am unable to convince our IT here to open up access to all the sites the build process is using.

Thank you for your help!

t-SNE package does not seem to work with Spark 2.1

Hi,

Looks like the t-SNE package does not work with Spark 2.1. After importing the com.github.saurfang.* package, a simple method to compute mean etc fails:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

val observations = sc.parallelize(
  Seq(
    Vectors.dense(1.0, 10.0, 100.0),
    Vectors.dense(2.0, 20.0, 200.0),
    Vectors.dense(3.0, 30.0, 300.0)
  )
)

// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean)  // a dense vector containing the mean value for each column
println(summary.variance)  // column-wise variance
println(summary.numNonzeros)  // number of nonzeros in each column

and that fails with:

Name: Compile Error
Message: <console>:50: error: type mismatch;
 found   : org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
 required: org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
       val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
                                                                         ^
StackTrace: 

Any thoughts on resolving this?

Thanks,
Rajesh

Getting the tsne result ?

Hi,

I'd like to give a try to your implementation of tsne with spark. In the example you show how to write every iteration a csv file. How can we just get the final iteration result without writing it in a file ? Is it possible to not use RxScala for this algorithm ?

Best,

Jao

t-SNE: is this a good use case for t-SNE to compress 1M features down to few k features?

This project was references in my Data Science question at https://datascience.stackexchange.com/a/35561/5380

It seems t-SNE is more suited for data visualization tasks? Than for our use case where we still want to keep as much variance as possible for modeling (see above that we're planning to bring down number features to several thousand).

t-SNE's focus is reducing down to 2-dimensions so it can be visualized? See lvdmaaten.github.io/tsne -

Can I use t-SNE to embed data in more than two dimensions?

Well, yes you can, but there is a catch. The key characteristic of t-SNE is that it solves a problem known as the crowding problem. The extent to which this problem occurs depends on the ratio between the intrinsic data dimensionality and the embedding dimensionality. So, if you embed in, say, thirty dimensions, the crowding problem is less severe than when you embed in two dimensions. As a result, it often works better if you increase the degrees of freedom of the t-distribution when embedding into thirty dimensions (or if you try to embed intrinsically very low-dimensional data such as the Swiss roll). More details about this are described in the AI-STATS paper.

Would it be a good use case for t-SNE to create a low-rank matrix of 1M features X 60M rows down to few thousand features? Would saurfang/spark-tsne scale to such datasets?

Thanks!

Install instructions?

Thank you for working on this project.
However, I am a bit clueless, and could not figure out how to install it (or get it installed by the administrators). The readme doesn't mention the installation process, and I couldn't see any other install file.

some bugs about persist?

RDD norm is persisted at X2P.scala:21, I think it's unnecessary persist since norm is only used once in the following codes.
P is cached at SimpleTSNE.scala:38, it should be uncached after last action on P.

com.github.saurfang#sbt-spark-submit;0.0.4: not found

I try to run MNIST example in Intellij, but I have this error in build phase.

Error:Error while importing SBT project:<br/>...<br/><pre>[error] 	at sbt.util.Tracked$.$anonfun$inputChanged$1(Tracked.scala:149)
[error] 	at sbt.internal.LibraryManagement$.cachedUpdate(LibraryManagement.scala:118)
[error] 	at sbt.Classpaths$.$anonfun$updateTask$5(Defaults.scala:2353)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:44)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:42)
[error] 	at sbt.std.Transform$$anon$4.work(System.scala:64)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:257)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:16)
[error] 	at sbt.Execute.work(Execute.scala:266)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:257)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:167)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:32)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] 	at java.lang.Thread.run(Thread.java:748)
Invalid response.
[error] (*:update) sbt.librarymanagement.ResolveException: unresolved dependency: com.github.saurfang#sbt-spark-submit;0.0.4: not found</pre>[...]

Allows large data?

Hello

I currently use sklearn's TSNE, and it is not very memory friendly. I wonder how this project compares to that one in terms of the rows in the data it can handle. Thanks.

Problem running **sbt sparkMNIST**

Hi,
I ran your algorithm (doing sbt sparkMNIST) on a file composed by 900 dense vectors, but i obtained the following error:


...
15/09/28 17:27:13 INFO JniLoader: already loaded netlib-native_system-linux-x86_64.so
15/09/28 17:27:51 INFO X2P: Mean value of sigma: 3.595693066031822              

[Stage 26:========>                                            (569 + 3) / 3600]15/09/28 17:28:44 

ERROR Utils: Uncaught exception in thread driver-heartbeater
java.io.IOException: java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
    at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
    at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.apache.spark.util.Utils$.deserialize(Utils.scala:91)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:440)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:430)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:430)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:428)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:428)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:472)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:472)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:274)
    at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
    at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
    at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
    at org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
    ... 32 more

This shouldn't be a problem of your code, but a Spark's bug (here the link).
I fixed running the following commands how suggested in the previous link:

sbt assembly
./bin/spark-submit --class com.github.saurfang.spark.tsne.examples.MNIST --master local[3] spark-tsne-examples-assembly-0.1-SNAPSHOT.jar

I hope that this message can be helpful for other users

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.