saurfang / spark-tsne Goto Github PK

View Code? Open in Web Editor NEW

160.0 13.0 37.0 99 KB

Distributed t-SNE via Apache Spark

Home Page: https://saurfang.github.io/spark-tsne-demo/tsne-pixi.html

License: Apache License 2.0

Scala 83.25% R 3.69% HTML 13.07%

spark tsne

spark-tsne's Introduction

spark-tsne

Distributed t-SNE with Apache Spark. WIP...

t-SNE is a dimension reduction technique that is particularly good for visualizing high dimensional data. This is an attempt to implement this algorithm using Spark to leverage distributed computing power.

The project is still in progress of replicating reference implementations from the original papers. Spark specific optimizations will be the next goal once the correctness is verified.

Currently I'm showcasing this using the standard MNIST handwriting recognition dataset. I have created a WebGL player (built using pixi.js) to visualize the inner workings as well as the final results of t-SNE. If a WebGL is unavailable for you, you may checkout the d3.js player instead.

Credits

spark-tsne's People

Contributors

Stargazers

Watchers

spark-tsne's Issues

Building Spark-TSNE Behind Corporate Firewall

Hi,

I am unable to download plugins etc from behind the firewall but I am very interested in looking at this code. Is it possible to provide an assembled jar? I am unable to convince our IT here to open up access to all the sites the build process is using.

Thank you for your help!

t-SNE package does not seem to work with Spark 2.1

Hi,

Looks like the t-SNE package does not work with Spark 2.1. After importing the com.github.saurfang.* package, a simple method to compute mean etc fails:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

val observations = sc.parallelize(
  Seq(
    Vectors.dense(1.0, 10.0, 100.0),
    Vectors.dense(2.0, 20.0, 200.0),
    Vectors.dense(3.0, 30.0, 300.0)
  )
)

// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean)  // a dense vector containing the mean value for each column
println(summary.variance)  // column-wise variance
println(summary.numNonzeros)  // number of nonzeros in each column

and that fails with:

Name: Compile Error
Message: <console>:50: error: type mismatch;
 found   : org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
 required: org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
       val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
                                                                         ^
StackTrace:

Any thoughts on resolving this?

Thanks,
Rajesh

no licence text in source files

Files with code have no licence at the beginning.

Getting the tsne result ?

Hi,

I'd like to give a try to your implementation of tsne with spark. In the example you show how to write every iteration a csv file. How can we just get the final iteration result without writing it in a file ? Is it possible to not use RxScala for this algorithm ?

Best,

Jao

t-SNE: is this a good use case for t-SNE to compress 1M features down to few k features?

This project was references in my Data Science question at https://datascience.stackexchange.com/a/35561/5380

It seems t-SNE is more suited for data visualization tasks? Than for our use case where we still want to keep as much variance as possible for modeling (see above that we're planning to bring down number features to several thousand).

t-SNE's focus is reducing down to 2-dimensions so it can be visualized? See lvdmaaten.github.io/tsne -

Can I use t-SNE to embed data in more than two dimensions?

Well, yes you can, but there is a catch. The key characteristic of t-SNE is that it solves a problem known as the crowding problem. The extent to which this problem occurs depends on the ratio between the intrinsic data dimensionality and the embedding dimensionality. So, if you embed in, say, thirty dimensions, the crowding problem is less severe than when you embed in two dimensions. As a result, it often works better if you increase the degrees of freedom of the t-distribution when embedding into thirty dimensions (or if you try to embed intrinsically very low-dimensional data such as the Swiss roll). More details about this are described in the AI-STATS paper.

Would it be a good use case for t-SNE to create a low-rank matrix of 1M features X 60M rows down to few thousand features? Would saurfang/spark-tsne scale to such datasets?

Thanks!

Install instructions?

Thank you for working on this project.
However, I am a bit clueless, and could not figure out how to install it (or get it installed by the administrators). The readme doesn't mention the installation process, and I couldn't see any other install file.

some bugs about persist?

RDD norm is persisted at X2P.scala:21, I think it's unnecessary persist since norm is only used once in the following codes.
P is cached at SimpleTSNE.scala:38, it should be uncached after last action on P.

com.github.saurfang#sbt-spark-submit;0.0.4: not found

I try to run MNIST example in Intellij, but I have this error in build phase.

Error:Error while importing SBT project:<br/>...<br/><pre>[error] 	at sbt.util.Tracked$.$anonfun$inputChanged$1(Tracked.scala:149)
[error] 	at sbt.internal.LibraryManagement$.cachedUpdate(LibraryManagement.scala:118)
[error] 	at sbt.Classpaths$.$anonfun$updateTask$5(Defaults.scala:2353)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:44)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:42)
[error] 	at sbt.std.Transform$$anon$4.work(System.scala:64)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:257)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:16)
[error] 	at sbt.Execute.work(Execute.scala:266)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:257)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:167)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:32)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] 	at java.lang.Thread.run(Thread.java:748)
Invalid response.
[error] (*:update) sbt.librarymanagement.ResolveException: unresolved dependency: com.github.saurfang#sbt-spark-submit;0.0.4: not found</pre>[...]

Allows large data?

Hello

I currently use sklearn's TSNE, and it is not very memory friendly. I wonder how this project compares to that one in terms of the rows in the data it can handle. Thanks.

Problem running sbt sparkMNIST

Hi,
I ran your algorithm (doing sbt sparkMNIST) on a file composed by 900 dense vectors, but i obtained the following error:


...
15/09/28 17:27:13 INFO JniLoader: already loaded netlib-native_system-linux-x86_64.so
15/09/28 17:27:51 INFO X2P: Mean value of sigma: 3.595693066031822              

[Stage 26:========>                                            (569 + 3) / 3600]15/09/28 17:28:44 

ERROR Utils: Uncaught exception in thread driver-heartbeater
java.io.IOException: java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
    at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
    at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.apache.spark.util.Utils$.deserialize(Utils.scala:91)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:440)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:430)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:430)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:428)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:428)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:472)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:472)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:274)
    at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
    at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1897)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
    at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
    at org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)
    ... 32 more

This shouldn't be a problem of your code, but a Spark's bug (here the link).
I fixed running the following commands how suggested in the previous link:

sbt assembly
./bin/spark-submit --class com.github.saurfang.spark.tsne.examples.MNIST --master local[3] spark-tsne-examples-assembly-0.1-SNAPSHOT.jar

I hope that this message can be helpful for other users

saurfang / spark-tsne Goto Github PK

spark-tsne's Introduction

spark-tsne

Credits

spark-tsne's People

Contributors

Stargazers

Watchers

Forkers

spark-tsne's Issues

Recommend Projects

Recommend Topics

Recommend Org