karpathy / tsnejs Goto Github PK

Implementation of t-SNE visualization algorithm in Javascript.

JavaScript 100.00%

tsnejs's Introduction

tSNEJS

tSNEJS is an implementation of t-SNE visualization algorithm in Javascript.

t-SNE is a visualization algorithm that embeds things in 2 or 3 dimensions. If you have some data and you can measure their pairwise differences, t-SNE visualization can help you identify clusters in your data. See example below.

Online demo

The main project website has a live example and more description.

There is also the t-SNE CSV demo that allows you to simply paste CSV data into a textbox and tSNEJS computes and visualizes the embedding on the fly (no coding needed).

Research Paper

The algorithm was originally described in this paper:

L.J.P. van der Maaten and G.E. Hinton.
Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research
9(Nov):2579-2605, 2008.

You can find the PDF here.

Example code

Import tsne.js into your document: <script src="tsne.js"></script> And then here is some example code:

var opt = {}
opt.epsilon = 10; // epsilon is learning rate (10 = default)
opt.perplexity = 30; // roughly how many neighbors each point influences (30 = default)
opt.dim = 2; // dimensionality of the embedding (2 = default)

var tsne = new tsnejs.tSNE(opt); // create a tSNE instance

// initialize data. Here we have 3 points and some example pairwise dissimilarities
var dists = [[1.0, 0.1, 0.2], [0.1, 1.0, 0.3], [0.2, 0.1, 1.0]];
tsne.initDataDist(dists);

for(var k = 0; k < 500; k++) {
  tsne.step(); // every time you call this, solution gets better
}

var Y = tsne.getSolution(); // Y is an array of 2-D points that you can plot

The data can be passed to tSNEJS as a set of high-dimensional points using the tsne.initDataRaw(X) function, where X is an array of arrays (high-dimensional points that need to be embedded). The algorithm computes the Gaussian kernel over these points and then finds the appropriate embedding.

Web Demos

There are two web interfaces to this library that we are aware of:

By Andrej, here.
By Laurens, here, which takes data in different format and can also use Google Spreadsheet input.

About

Send questions to @karpathy.

License

MIT

tsnejs's People

Contributors

Stargazers

Watchers

Forkers

kod3r wogsland tianwalker2012 jcjview sacado domluna hihihippp pearsonhenry ziggreen dwinston shuangfengderen 10sun jayhetee jmrinaldi jmolayem ehsansherkat mcanthony caomw noscripter zerkh alex88o edeno tinyloop fagiraldo putraxor rwzhao canesin imclab jatinjindalj zbxzc35 gongfupanada piandpower gdg ychelanguagestudio vyraun kapilkoundinya pjpan jai-chaudhary shyamalschandra piotrgrudzien slater-victoroff zilongzhong dthevenin jayinai igor-krawczuk manasrk exialym satishjasthi bhargavpanth ssxiexiao linglin00 zhuangh yetanothertimes eye942 alexxnica kryndex spmohanty brzonsea cc13ny wshenx labbros flynnwang tongsong91 jeancroy agile-innovations jondea stonexjr praveenmunagapati saadmahboob gembin kcf-jackson olexandrp katherineaa prnvmakhijani45 reddmist zhensongqian hoaxoan xuezhizeng maxtortime jalamao 0xshreyash vrvs rafaelmessias lixuanxian 6repenni afcarl pandinosaurus lisa-wu brianiruka andradeandrey akiori fzyukio milesqli yati989 ncammarata iammosespaulr daehongkim1 yew cube3power wujx990

tsnejs's Issues

Remove max cells condition

Hi,

How can I remove the max cell condition to run the t-sne code on a larger set of data ?

Thanks in advance for your help.

Node Package

Hello,
thank you for creating this library and mainting it!
I have some questions for you, if possible.
Is there a node package available for installing the library?
If not:
Will there ever be a node package to install this library?

Thank you for your time

Early exaggeration

Super late but I think there is a small bug with how early exaggeration is implemented. The code has:

var premult = 4 * (pmul * P[i*N+j] - Q[i*N+j]) * Qu[i*N+j];

whereas I think it should be

var premult = 4 * pmul * (P[i*N+j] - Q[i*N+j]) * Qu[i*N+j];

I.e. the early exaggeration factor pmul should multiply the overall gradient rather than P alone.

The t-SNE authors describe early exaggeration as scaling P for the first few iterations. Since P is a constant, you'd think you could achieve the effect just by scaling P in the gradient. This is what the code does.

But at the same time, since the loss (-(P * Q.log()).sum()) is multiplied by P, scaling P should also scale the overall gradient. Now I'm confused.

In Appendix A of the original paper, the authors assume the sum of P is 1. But under early exaggeration the sum is pmul. In this case, the q_ij term in the gradient becomes pmul * q_ij. So really the overall gradient should be scaled by pmul.

Not sure if this "fix" should be implemented, since the code has worked reliably for years. But thought it was worth noting.

Trouble with NaNs after step function updates

Some background: I'm trying to visualize my Spotify playlists but I'm having some trouble getting going here.

My data consists of 267 songs, each song has 10 features. Here's a sample (ignore the artist and title fields).

{
"artist":"Drake",
"audio_summary.acousticness":0.016128527,
"audio_summary.danceability":0.3236382,
"audio_summary.energy":0.8417243,
"audio_summary.key":7,
"audio_summary.liveness":0.13018084,
"audio_summary.loudness":-5.548,
"audio_summary.mode":1,
"audio_summary.speechiness":0.0,
"audio_summary.tempo":98.39,
"audio_summary.time_signature":5,
"title":"Over"
}

I'm passing the data as a 267 element array, each element is a 10 element array. I'm using initDataRaw to initialize but I've tried both init methods.

The problem is even after just one call to the step function, getSolution returns [NaN, NaN].

Now I had this problem originally but I switched from initDataDist to initDataRaw, that seemed to avoid the NaN. The visualization I got though was off. I wish I had taken a picture because I'm having trouble reproducing it due to NaN issues but essentially the songs spread out on a diagonal in a line as if it was being compressed to 1D.

I thought maybe the issue were some fields having values much larger than others, tempo for example. So I normalized all the features and then came the NaN problem. The weird this is that even my old non-normalized data is giving me NaNs now!

Any ideas of what I'm doing wrong? Tips for getting the data setup in general (avoiding NaNs)?

Thanks!

README: I struggled to make the library work because I wrongly inferred from the example in README.md that somehow the "distance" or "dissimilarity" matrix had to have 1.0 on the diagonal, and not 0.0 as they would with distances. (Also, the "example" matrix isn't symetric.)
I finally understood that it was really a pairwise distance matrix, and fed it geodesic distances to make http://bl.ocks.org/Fil/b07d09162377827f1b3e266c43de6d2a
Web Worker. tsne tries to attach itself to window, which does not allow to use it in a web woker (as in http://bl.ocks.org/Fil/e402e9c51ce77c21baedc2d1af933bc3 , which I made with https://github.com/scienceai/tsne-js ). This is probably a simple fix.
Learning: Is there any possibility to expose the model — and use the generated mapping to project points that were not given initially?
Online: is it possible to augment a trained model with new data?
Seeding: can we seed the model with initial positions?