lvdmaaten / bhtsne Goto Github PK

View Code? Open in Web Editor NEW

891.0 891.0 241.0 107 KB

Barnes-Hut t-SNE

License: Other

Python 13.10% MATLAB 7.60% C++ 79.30%

bhtsne's People

Contributors

Stargazers

Watchers

Forkers

by321 burgerdev chenglongchen nagyistoce felixneutatz 10sun rohit-gupta nikitasrivatsan cuskinfor sg-s ml-lab zhmz90 meteora9479 jumperwang beniz sh1r0 oftensmile isaac-you nrasiwas xuguozhi xsongx kylemcdonald codeaudit peratham fdoperezi ml-ai-nlp-ir hyqleonardo hrantzsch dominiek layumi zhenyangiacas josvanroosmalen joernroeder tony-hong apoorvabapat tonychouzju apbard gandalfvn jakirkham jacobbaron maximsch2 lxpwuqiuyu der-tim hongxinggao lucienevans caomw digideskio prathi019 imclab kelly-tlz courageon pderkowski leomauro gpapadop79 gjtjx make xuanzhaopeng lh3 tjsongzw arielf michaellevinson alexeyantonov mattsooknah hbpmedical ogoid adrianveres anna0509 brucecui0120 umass-bib mutual-ai nunofernandes-plight sukhbinder a2call flyinggh anukat2015 korkam alinaag alexandreday exialym mvpduncan digidis hayslab varunagrawal matthieu-foucault ajcse1 k-du jiapei100 duolinwang phiradet arshak hylasd nikmikov rjurney appcoreopc soledad89 yangqing joselpart lxhsjtu jugol ryfan-rs

bhtsne's Issues

Dynamic lying factor

@lvdmaaten Have you tried to use dynamic lying factor instead of static 12.0?
Got little lower error (dropped from 1.428 to 1.394 ) by replacing static lying factor with dynamic value:

    double lying_factor = 12.0;
    double lying_decrease = 0;
    if(stop_lying_iter > 0)
        lying_decrease = (lying_factor - 1.0) / (double) stop_lying_iter;

        if(iter < stop_lying_iter) {
            if(exact) { for(int i = 0; i < N * N; i++)        P[i] /= lying_factor; }
            else      { for(int i = 0; i < row_P[N]; i++) val_P[i] /= lying_factor; }
            lying_factor -= lying_decrease;
            if(exact) { for(int i = 0; i < N * N; i++)        P[i] *= lying_factor; }
            else {      for(int i = 0; i < row_P[N]; i++) val_P[i] *= lying_factor; }
        }

Output for bhtsne with static lying factor for input of 10000 x 200 samples:

Learning embedding...
Iteration 50: error is 83.834001 (50 iterations in 6.29 seconds)
Iteration 100: error is 82.772901 (50 iterations in 5.75 seconds)
Iteration 150: error is 82.683113 (50 iterations in 5.59 seconds)
Iteration 200: error is 82.657414 (50 iterations in 6.91 seconds)
Iteration 250: error is 4.401062 (50 iterations in 5.98 seconds)
Iteration 300: error is 2.621082 (50 iterations in 6.51 seconds)
Iteration 350: error is 2.200671 (50 iterations in 5.28 seconds)
Iteration 400: error is 1.975511 (50 iterations in 5.58 seconds)
Iteration 450: error is 1.834608 (50 iterations in 6.70 seconds)
Iteration 500: error is 1.740066 (50 iterations in 5.66 seconds)
Iteration 550: error is 1.671593 (50 iterations in 7.15 seconds)
Iteration 600: error is 1.619357 (50 iterations in 7.49 seconds)
Iteration 650: error is 1.579096 (50 iterations in 8.11 seconds)
Iteration 700: error is 1.550052 (50 iterations in 6.58 seconds)
Iteration 750: error is 1.528960 (50 iterations in 6.04 seconds)
Iteration 800: error is 1.516330 (50 iterations in 7.10 seconds)
Iteration 850: error is 1.507626 (50 iterations in 5.70 seconds)
Iteration 900: error is 1.500774 (50 iterations in 7.26 seconds)
Iteration 950: error is 1.495621 (50 iterations in 6.24 seconds)
Iteration 1000: error is 1.490954 (50 iterations in 6.61 seconds)
Iteration 1050: error is 1.486083 (50 iterations in 5.91 seconds)
Iteration 1100: error is 1.480977 (50 iterations in 9.09 seconds)
Iteration 1150: error is 1.475732 (50 iterations in 12.08 seconds)
Iteration 1200: error is 1.471433 (50 iterations in 7.36 seconds)
Iteration 1250: error is 1.467418 (50 iterations in 5.78 seconds)
Iteration 1300: error is 1.463404 (50 iterations in 7.68 seconds)
Iteration 1350: error is 1.459785 (50 iterations in 5.97 seconds)
Iteration 1400: error is 1.456251 (50 iterations in 5.53 seconds)
Iteration 1450: error is 1.453166 (50 iterations in 6.60 seconds)
Iteration 1500: error is 1.450100 (50 iterations in 5.54 seconds)
Iteration 1550: error is 1.447593 (50 iterations in 7.94 seconds)
Iteration 1600: error is 1.445209 (50 iterations in 5.48 seconds)
Iteration 1650: error is 1.442866 (50 iterations in 5.72 seconds)
Iteration 1700: error is 1.440184 (50 iterations in 6.16 seconds)
Iteration 1750: error is 1.437438 (50 iterations in 5.08 seconds)
Iteration 1800: error is 1.435075 (50 iterations in 6.89 seconds)
Iteration 1850: error is 1.432948 (50 iterations in 5.19 seconds)
Iteration 1900: error is 1.431293 (50 iterations in 5.38 seconds)
Iteration 1950: error is 1.429766 (50 iterations in 6.69 seconds)
Iteration 1999: error is 1.428257 (50 iterations in 5.49 seconds)

Output for bhtsne with dynamic lying factor for input of 10000 x 200 samples:

Learning embedding...
Iteration 50: error is 65.368017 (50 iterations in 7.20 seconds)
Iteration 100: error is 46.293893 (50 iterations in 6.24 seconds)
Iteration 150: error is 29.246011 (50 iterations in 7.21 seconds)
Iteration 200: error is 13.983286 (50 iterations in 6.51 seconds)
Iteration 250: error is 2.634763 (50 iterations in 7.59 seconds)
Iteration 300: error is 2.010282 (50 iterations in 5.58 seconds)
Iteration 350: error is 1.809022 (50 iterations in 5.98 seconds)
Iteration 400: error is 1.698381 (50 iterations in 6.79 seconds)
Iteration 450: error is 1.626216 (50 iterations in 6.61 seconds)
Iteration 500: error is 1.575453 (50 iterations in 7.00 seconds)
Iteration 550: error is 1.539009 (50 iterations in 5.83 seconds)
Iteration 600: error is 1.511758 (50 iterations in 7.45 seconds)
Iteration 650: error is 1.493206 (50 iterations in 5.83 seconds)
Iteration 700: error is 1.480972 (50 iterations in 6.17 seconds)
Iteration 750: error is 1.473207 (50 iterations in 6.50 seconds)
Iteration 800: error is 1.467411 (50 iterations in 5.75 seconds)
Iteration 850: error is 1.462050 (50 iterations in 6.94 seconds)
Iteration 900: error is 1.457054 (50 iterations in 5.78 seconds)
Iteration 950: error is 1.452213 (50 iterations in 6.93 seconds)
Iteration 1000: error is 1.447649 (50 iterations in 5.43 seconds)
Iteration 1050: error is 1.443852 (50 iterations in 6.00 seconds)
Iteration 1100: error is 1.440698 (50 iterations in 6.88 seconds)
Iteration 1150: error is 1.437592 (50 iterations in 6.08 seconds)
Iteration 1200: error is 1.434110 (50 iterations in 7.13 seconds)
Iteration 1250: error is 1.430698 (50 iterations in 5.43 seconds)
Iteration 1300: error is 1.427438 (50 iterations in 6.43 seconds)
Iteration 1350: error is 1.424260 (50 iterations in 4.72 seconds)
Iteration 1400: error is 1.421095 (50 iterations in 10.57 seconds)
Iteration 1450: error is 1.418043 (50 iterations in 11.95 seconds)
Iteration 1500: error is 1.415134 (50 iterations in 5.66 seconds)
Iteration 1550: error is 1.412521 (50 iterations in 6.57 seconds)
Iteration 1600: error is 1.409757 (50 iterations in 6.42 seconds)
Iteration 1650: error is 1.407288 (50 iterations in 5.95 seconds)
Iteration 1700: error is 1.404948 (50 iterations in 6.25 seconds)
Iteration 1750: error is 1.403020 (50 iterations in 5.08 seconds)
Iteration 1800: error is 1.401276 (50 iterations in 7.00 seconds)
Iteration 1850: error is 1.399376 (50 iterations in 5.49 seconds)
Iteration 1900: error is 1.397579 (50 iterations in 5.88 seconds)
Iteration 1950: error is 1.395958 (50 iterations in 6.58 seconds)
Iteration 1999: error is 1.394586 (50 iterations in 6.41 seconds)

Any opinions on this experiment?

Usage of random generator(s) in the source

Hello Laurens:

I am trying to have full control of results in the sense of getting exactly the same output for a given random seed. Could you please confirm that I got it right: there are only two places where a random generator is used:

When the user wants to initialize Y via Normal distribution.
In vptree.h there is a single line ("Create an arbitrary point...") where a random generator is used.

Thanks,
Nik Tuzov, PhD

C API

Both tsne.h and tsne_main.cpp seem to be almost entirely C code. Would you be opposed to a pull request making them C compatible? This would make your code usable as both C and C++ library, allowing for easy implementation of language bindings without intermediate files.

I am interested in this for the C library bindings and for contributing Go bindings.

Wrong python script name in the MATLAB wrapper

The MATLAB wrapper fails for me after installation with:

'.../bh_tsne/bh_tsne' is not recognized as an internal or external command,

operable program or batch file.

This error is displayed in the stdout of the system call in fast_tsne.m on line 79:

tic, system(fullfile(tsne_path,'./bh_tsne')); toc

The reason seems to be that the python script this call refers to does not contain the underscore. So the very simple workaround is to change the above line to:

tic, system(fullfile(tsne_path,'./bhtsne')); toc

Segmentation fault

0x0000000000403662 in TSNE::computeGaussianPerplexity (this=0x612010, X=0x7ffff6fab010, N=10000, D=30, _row_P=0x7fffffffe3d0, _col_P=0x7fffffffe3d8,
_val_P=0x7fffffffe3e0, perplexity=50, K=150) at tsne.cpp:470
470 cur_P[m] = exp(-beta * distances[m + 1] * distances[m + 1]);

The reason is sizeof(distance) = 24, but K = 150, how it works??

Sparse Input Data

Is there a way to input sparse data? I suspect this is not a straight-forward thing to do, because of the lack of a standard way to store sparse matrices in a text file, i.e. python probably does it different than matlab (did not check though).

OT: I just watched a video of you presenting t-SNE at Google and I want to compliment you on your explanation skills. Very clear and understandable.

Fortran porting/implementation: suggestions

Laurens (and anyone else)

I've got a stupid question: we're stuck with some very complex Fortran code for neural recording analysis. Basically we can't really write the fortran suite from scratch, but would love to try tSNE somehow.

What do you recommend? I'm willing run tSNE through Fortran system calls, but that's quite a hack. Do you know anyone implementing your work in Fortran?

Thanks for your time.
catubc

Can't compile file on Windows

After following the steps in the readme the command window tells me:

sptree.cpp(111): error: c3861: 'fmax': identifier not found

pre_Installation of toolboxes in Matlab R2015b

Hello~

Thank you very much to share.

I want to use your code on Matlab R2015b, Do I need to pre-install any software or toolbox?

Error using pca

Hello!

I download this package and installed it on OS X and Windows.

But When I run the Demonstration of usage in Matlab ( https://github.com/lvdmaaten/bhtsne/), I found the error message as blow.

Could you please provide the new version or a method to avoid this problem?

Thank you very much for your share.

Document the "gains"

The computations involving the "gains" in tsne.cpp, line 72 carry the awe-inspiring comment

// Allocate some memory

This is not just "some memory". These are parts of a computation that is critical for the implementation to work properly. Neither the paper nor any of the copycat implementations have any information about what these "gains" are. Maybe it's obvious for those who are more deeply involved. But nevertheless, at some point, it should be explained what these "gains" actually are.

Segmentation fault on running bh_tsne

HI, I think I'm seeing the same thing as this issue.

I compiled as in the instructions on Kali Linux using g++. It seemed to compile fine.

When I run it on some test data, I get this error:

Read the 4329 x 10 data matrix successfully!
Using current time as random seed...
Using no_dims = 2, perplexity = 60.000000, and theta = 0.500000
Computing input similarities...
Building tree...
 - point 0 of 4329
Segmentation fault

I have attached the binary that caused the problem, together with the data.dat file. I have also verified that another binary compiled on macOS works fine with this particular data.dat

seg_fault.zip

This seems to be an issue on many different OSes, including on Windows.

(paging @he-zhe)

App crash when running fast_tsne

I attempt to run fast_tsne from matlab using the wrapper. After some time of processing I get an app crash.

I'm on a Win7 SP1 with MATLAB 2015b.
fast_tsne code from the master branch

Other:
The data.dat is being written, the crash is during the process of the binary itself. In the first seconds of operation the bh_tsne process allocates a lot of memory (up to 4.4 GB in my case). This is within limits and even just before the moment of crash there is still 12% of free physical memory.

Performance difference Windows/Ubuntu

Hi all,

I've experienced weird performance issues when compiling the binary on my Windows 10 home desktop vs. an Ubuntu 18.04 virtual machine. I compiled the binary using the given instructions in this repository, that is

g++ sptree.cpp tsne.cpp tsne_main.cpp -o bh_tsne -O2
on Ubuntu and

nmake -f Makefile.win all
on windows (using Visual Studio 2019)

Still, on windows, using all 70000 MNIST digits, the .exe runs only half the time the binary requires on ubuntu, see the following logs:

Windows:

Computing input similarities...
Building tree...

point 0 of 70000
point 10000 of 70000
point 20000 of 70000
point 30000 of 70000
point 40000 of 70000
point 50000 of 70000
point 60000 of 70000
Input similarities computed in 734.95 seconds (sparsity = 0.002964)!
Learning embedding...
Iteration 1: error is 114.707556
Iteration 50: error is 114.707556 (50 iterations in 52.88 seconds)
Iteration 100: error is 114.707555 (50 iterations in 66.19 seconds)
Iteration 150: error is 114.706666 (50 iterations in 63.20 seconds)
Iteration 200: error is 108.965702 (50 iterations in 66.21 seconds)
Iteration 250: error is 5.916399 (50 iterations in 54.78 seconds)
Iteration 300: error is 4.703993 (50 iterations in 64.33 seconds)
Iteration 350: error is 4.304277 (50 iterations in 66.77 seconds)
Iteration 400: error is 4.067927 (50 iterations in 51.41 seconds)
Iteration 450: error is 3.899897 (50 iterations in 51.27 seconds)
Iteration 500: error is 3.772200 (50 iterations in 51.67 seconds)
Iteration 550: error is 3.669212 (50 iterations in 51.34 seconds)
Iteration 600: error is 3.585067 (50 iterations in 52.51 seconds)
Iteration 650: error is 3.513970 (50 iterations in 50.97 seconds)
Iteration 700: error is 3.452446 (50 iterations in 51.37 seconds)
Iteration 750: error is 3.398439 (50 iterations in 52.12 seconds)
Iteration 800: error is 3.350182 (50 iterations in 51.71 seconds)
Iteration 850: error is 3.306805 (50 iterations in 51.35 seconds)
Iteration 900: error is 3.267429 (50 iterations in 51.92 seconds)
Iteration 950: error is 3.231636 (50 iterations in 52.13 seconds)
Iteration 1000: error is 3.199034 (50 iterations in 51.68 seconds)
Fitting performed in 1105.82 seconds.

Ubuntu:

Computing input similarities...
Building tree...

point 0 of 70000
point 10000 of 70000
point 20000 of 70000
point 30000 of 70000
point 40000 of 70000
point 50000 of 70000
point 60000 of 70000
Input similarities computed in 735.45 seconds (sparsity = 0.002964)!
Learning embedding...
Iteration 1: error is 114.707556
Iteration 50: error is 114.707556 (50 iterations in 114.57 seconds)
Iteration 100: error is 114.707555 (50 iterations in 125.56 seconds)
Iteration 150: error is 114.706492 (50 iterations in 116.89 seconds)
Iteration 200: error is 109.278084 (50 iterations in 129.52 seconds)
Iteration 250: error is 5.949873 (50 iterations in 140.74 seconds)
Iteration 300: error is 4.721405 (50 iterations in 126.03 seconds)
Iteration 350: error is 4.318675 (50 iterations in 120.15 seconds)
Iteration 400: error is 4.080921 (50 iterations in 116.11 seconds)
Iteration 450: error is 3.910903 (50 iterations in 114.96 seconds)
Iteration 500: error is 3.780495 (50 iterations in 113.75 seconds)
Iteration 550: error is 3.677030 (50 iterations in 118.45 seconds)
Iteration 600: error is 3.591950 (50 iterations in 114.40 seconds)
Iteration 650: error is 3.520119 (50 iterations in 112.86 seconds)
Iteration 700: error is 3.458082 (50 iterations in 111.77 seconds)
Iteration 750: error is 3.403688 (50 iterations in 112.70 seconds)
Iteration 800: error is 3.355337 (50 iterations in 114.43 seconds)
Iteration 850: error is 3.311879 (50 iterations in 111.70 seconds)
Iteration 900: error is 3.272490 (50 iterations in 112.07 seconds)
Iteration 950: error is 3.236595 (50 iterations in 114.01 seconds)
Iteration 1000: error is 3.203558 (50 iterations in 113.41 seconds)
Fitting performed in 2354.09 seconds..

TL;DR: while constructing the nearest-neighbor tree takes almost the same time on both machines, the iterations take twice as long on ubuntu.

Any ideas on what could be going wrong would be greatly appreciated! Thanks

run_bh_tsne expects an input file not a numpy array

After #29 my code doesn't run anymore. I think it is because the first argument of run_bh_tsne now expects an input_file and not a numpy array anymore. The function bh_tsne now makes a numpy array using this input_file.

Is this intended behaviour? In my opinion it would be better to provide both ways: start bh_tsne with input file and with a numpy array.

The error message:

File "/home/hans/wart-detection/bhtsne.py", line 206, in run_bh_tsne
    init_bh_tsne(input_file, tmp_dir_path, no_dims=no_dims, perplexity=perplexity, theta=theta, randseed=randseed,verbose=verbose, initial_dims=initial_dims, use_pca=use_pca, max_iter=max_iter)
  File "/home/hans/wart-detection/bhtsne.py", line 101, in init_bh_tsne
    for l in input_file), start=1):
  File "/home/hans/wart-detection/bhtsne.py", line 101, in <genexpr>
    for l in input_file), start=1):
AttributeError: 'numpy.ndarray' object has no attribute 'rstrip'
Exception AttributeError: "'NoneType' object has no attribute 'path'" in <function _remove at 0x7fb39c57a6e0> ignored

Javascript example

I've built the library for node.js, but it's not obvious how to call the library (parameters, callbacks, etc.). Is it possible to add a javascript call example to the readme ?

Thanks

Ian

Could u explain how to define the perplexity?

I found if If I define the perplexity small than 0, then the K always be 0, because you define
int K = (float)perplexity * 3, so K always be 0.

If I define the perplexity > 0, then I will get segmentation fault.....(because the sizeof (distances) != K)

I'd like to know, this source code still could be use? or you already don't use it any more ....

Is there a rule of thumb for the lower bound on the perplexity?

Dear Dr. van der Maaten:

Could you help me enhance my understanding of how the perplexity parameter works. There are two questions.

Looking at the implementation, do I get it right that a reasonable upper bound on perplexity is equal to 1/3 of the minimal expected cluster size (for simplicity, assume we know what cluster sizes to expect).
On your home page, there is a question (“I get a strange ‘ball’ with uniformly distributed points”) and your suggestion is to reduce perplexity. Do you think the same “ball” effect can be see when perplexity is too low? If yes, how do you suggest we define a lower bound for perplexity?

Regarding 2), I have this digit images data set with 40,000 points that is supposed to contain 10 clusters of about the same size. When I subsample 2000 points and run default Rtsne (its implementation is very similar to yours) the embedding looks nice. However, it is far worse on the full data set. I figured it was because the default perplexity of 30 was too low compared to the typical cluster size, 4000, so I reset it to 30*20 = 600 and obtained a very nice embedding.

When the expected result is unknown, I guess one could try to use a similar subsampling approach to figure out how to increase perplexity. I was wondering if you know of a more analytical method or a rule of thumb.

Regards,
Nik Tuzov, PhD

Usage of python wrapper

Hello!

I compiled the program and tried the matlab example and it works.

But when I try the python wrapper, no bh_tsne.exe process is starting and I get nothing in the output, so I don't know what is the problem. My system is windows 7 x64.

Could you please provide a similar usage example for python like the one for matlab?

Thanks in advance

Unable to build with visual studio 2015

When I try and follow the windows instructions in the readme I get the following error:

cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /openmp tsne.obj sptree.obj -Fewindows\bh_tsne.exe
libcpmt.lib(xthrow.obj) : error LNK2038: mismatch detected for '_MSC_VER': value '1900' doesn't match value '1800' in tsne.obj
libucrt.lib(hypot.obj) : error LNK2005: hypot already defined in tsne.obj
windows\bh_tsne.exe : fatal error LNK1169: one or more multiply defined symbols found
NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe"' : return code '0x2'
Stop.

I'm not sure how I can edit the tsne.obj file to adjust this, any advice?

Thanks!

Can not use the python wrapper in Windows

I was trying to use python wrapper in windows and I used your example code. But this error was raised:

AttributeError: module 'os' has no attribute 'fork' which seems reasonable in Windows. Do you have any suggestion to solve this problem? Thanks!

Why is the exact algorithm 10 times faster?

Hi, I was wondering about the speed of exact tsne and Barnes-Hut tsne in C++ code.

In my case, the speed of exact tsne is almost 10 times faster than BH tsne, which does not make much sense theoretically. Have anyone encountered the similar results?

If exact tsne is faster in the C++ version, could anyone explain a bit about why this is the case? Really appreciate it!!

Using bhtsne.py with a numpy array

Hello,

I have found this implementation as sklearn's TSNE doesn't scale well with my 50k x 50k similarity matrix. Is there a simple way to pass this matrix the same way it is passed in scikit-learn. Thanks.

t-SNE for Java/Scala/Kotlin/Clojure

Hi, Smile has an implantation of t-SNE for Java/Scala/Kotlin/Clojure. Would you please add a link to it in your t-SNE page? Thanks.

Alternative metrics

It would be awesome if we could choose between several distance measures (e.g. jaccard).

Support for Ubuntu 14.04 LTS?

Does this implementation supports Ubuntu 14.04 LTS ? Would that be possible to add a clean documentation for this ?

Help ME! thanks!

HELP!!! troubles in HLLE.m ，when i put minst dataset in hlle ,it come out a error in line 118 "[mappedX, eigenvals] = eigs(G, no_dims + 1, tol, options);" .
"""
Error in eigs (line 93)
[A,Amatrix,isrealprob,issymA,n,B,classAB,k,eigs_sigma,whch, ...
Error in hlle (line 118)
[mappedX, eigenvals] = eigs(G , no_dims + 1, tol, options);
Error in test_hlle_minst (line 8)
[mappedX] = hlle(train_X, 2, 12);
"""
i would be very appreciate if you could help me overcome these errors!
thank U!

Can not read and pass on the results

I ran python bh_tsne on a 95 * 745544 matrix, and here is my command ./bhtsne.py -i ~/Dropbox/github/data/lan_uid_matrix.txt -o ~/Dropbox/github/data/lan_uid_coordinate.txt -p 5 -d 2 -t 1 -v

but it shows the error as follows:
Error: could not open data file.
Traceback (most recent call last):
File "./bhtsne.py", line 233, in
exit(main(argv))
File "./bhtsne.py", line 224, in main
verbose=argp.verbose, initial_dims=argp.initial_dims, use_pca=argp.use_pca, max_iter=argp.max_iter):
File "./bhtsne.py", line 211, in run_bh_tsne
for result in bh_tsne(tmp_dir_path, verbose):
File "./bhtsne.py", line 164, in bh_tsne
with open(path_join(workdir, 'result.dat'), 'rb') as output_file:
IOError: [Errno 2] No such file or directory: '/var/folders/92/8ty0c6392m773r5tbp4s9gy80000gp/T/tmpakdFT0/result.dat'

I don't know why it can not find the result.dat. Could you help me solve it?

Thanks in advance

running on multiple cores?

Hey,
it's more a question than an actual issue: I'm mapping a dataset with 32dims x 900000items with tsne on a multi-core machine but as tsne is single threaded i'm just using one core. Do you have any tipps or tricks how i can split the dataset to parallelize computation?
thanks in advance!

Butterfly effect

Hello Laurens:

There is a sizeable difference in the output quantities when the difference between the input data sets is virtually zero. The R code attached below provides an illustration, and a similar issue exists with your code as well. I know it's all about relative distances among the points, so as long as the visualizations look similar (which is the case) the user shouldn't care. Still, it would be nice to see more consistent numbers in the output when the input data are virtually the same.

Based on my tests, the divergence occurs in computeNonEdgeForces() which causes computeGradient() to diverge as well.

Regards,
Nik Tuzov, PhD

===============================================

library(Rtsne)
library(rgl)

set.seed(115)
iris_unique <- unique(iris)
Y_zero <- as.matrix(iris_unique[, 1:3])
tsne_out3d <- Rtsne(as.matrix(iris_unique[, 1:4]), dims = 3, Y_init = Y_zero)
plot3d(tsne_out3d$Y[, 1], tsne_out3d$Y[, 2], tsne_out3d$Y[, 3], col = as.numeric(iris_unique$Species))
head(tsne_out3d$Y)

set.seed(115)
iris_unique_butt = iris_unique;
iris_unique_butt[1, 1] = iris_unique_butt[1, 1] + 1e-6;
tsne_out3d_butt <- Rtsne(as.matrix(iris_unique_butt[, 1:4]), dims = 3, Y_init = Y_zero)
plot3d(tsne_out3d_butt$Y[, 1], tsne_out3d_butt$Y[, 2], tsne_out3d_butt$Y[, 3], col = as.numeric(iris_unique$Species))
head(tsne_out3d_butt$Y)

indexing bugs in tsne.cpp

Hello,
I have found some indexing bugs in the methods
TSNE::computeSquaredEuclideanDistance
and
TSNE::run
for the Exact computation of tSNE. (in the section where you symmetrize the input probability matrix)
If this has already been corrected, thank you, and please let me know and I will make sure I obtain the updated version.

In both cases, 2 nested loops are being used with a similar indexing format. I will use the euclidean distance matrix calculation to describe what I see:

nN = 0; nD = 0;
for(int n = 0; n < N; n++) {
int mD = 0; //always indexes to the first data point in X
DD[nN + n] = 0.0;
for(int m = n + 1; m < N; m++) {
DD[nN + m] = 0.0;
for(int d = 0; d < D; d++) {
DD[nN + m] += (X[nD + d] - X[mD + d]) * (X[nD + d] - X[mD + d]);
}
DD[m * N + n] = DD[nN + m];
mD+=D;
}
nN+=N;
nD += D;
}

The problem is that mD always starts at 0, but m always starts at n+1. It seems that what is intended is that mD should contain the index of the m'th data point in X. If we describe DD as a matrix instead of an array, when you process row 0, you end up calculating the following:
DD[0,1] = ||x0 - x0||^2
DD[0,2] = ||x0 - x1||^2
DD[0,3] = ||x0 - x2||^2
where x0 is the first point, x1 is the second point, and x2 is the third point. In processing the second row, you get:
DD[1,2] = ||x1 - x0||^2
DD[1,3] = ||x1 - x1||^2
DD[1,4] = ||x1 - x2|| ^2

This can be fixed by changing the value of mD:

int mD;
nN = 0; nD = 0;
for(int n = 0; n < N; n++) {
// int mD = 0; //remove this line
DD[nN + n] = 0.0;
for(int m = n + 1; m < N; m++) {
mD = m * D; //this is the added code
DD[nN + m] = 0.0;
for(int d = 0; d < D; d++) {
DD[nN + m] += (X[nD + d] - X[mD + d]) * (X[nD + d] - X[mD + d]);
}
DD[m * N + n] = DD[nN + m];
// mD+=D; //remove this line
}
nN+=N;
nD += D;
}

The bug fix in TSNE::run is similar:

// Symmetrize input similarities
printf("Symmetrizing...\n");
int nN = 0;
for(int n = 0; n < N; n++) {
//int mN = 0; //remove this line
for(int m = n + 1; m < N; m++) {
int mN = m*N; //this is the added code
P[nN + m] += P[mN + n];
P[mN + n] = P[nN + m];
//mN += N; //remove this line
}
nN += N;
}

As a last note, thank you for providing this technique and implementation. It has proved to be a great embedding technique for the type of data I work with.

Sincerely,

Allison

Dimension problem

Hello,
Im trying to use the code for a image dataset that is in a h5 file, i changed the load data function in the python wrapper for this

def load_data(input_file):
    with h5py.File('data4.h5', 'r') as hf:
    	data = hf['data1'][:]
    return data

Where data1 is the name of the dataset inside the h5 file
But i get the following error

Traceback (most recent call last)
  File "bhtsne.py", line 246, in <module>
    exit(main(argv))
  File "bhtsne.py", line 237, in main
    verbose=argp.verbose, initial_dims=argp.initial_dims, use_pca=argp.use_pca, max_iter=argp.max_iter):
  File "bhtsne.py", line 208, in run_bh_tsne
    init_bh_tsne(data, tmp_dir_path, no_dims=no_dims, perplexity=perplexity, theta=theta, randseed=randseed,verbose=verbose, initial_dims=initial_dims, use_pca=use_pca, max_iter=max_iter)
  File "bhtsne.py", line 112, in init_bh_tsne
    cov_x = np.dot(np.transpose(samples), samples)
ValueError: shapes (32,32,3,2462) and (2462,3,32,32) not aligned: 2462 (dim 3) != 32 (dim 2)
Traceback (most recent call last):
  File "bhtsne.py", line 246, in <module>
    exit(main(argv))
  File "bhtsne.py", line 237, in main
    verbose=argp.verbose, initial_dims=argp.initial_dims, use_pca=argp.use_pca, max_iter=argp.max_iter):
  File "bhtsne.py", line 218, in run_bh_tsne
    for result in bh_tsne(tmp_dir_path, verbose):
  File "bhtsne.py", line 163, in bh_tsne
    with open(path_join(workdir, 'result.dat'), 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpx2rn_uow/result.dat'

How can I read my dataset correctly?
If its not possible I still have the data in a folder structure where each folder represents a category, sorry for my lack of knowledge but can I pass it and should pass it to a csv file?

Thank you very much

Performance difference to the old version

Hey!

Does anyone have idea why the lates version (master) is slower than the older version available in
https://github.com/danielfrg/tsne/tree/master/tsne/bh_sne_src

I profiled the issue with the mnist2500 data set and the problem lies in the recursive function computeNonEdgeForces.

Is there something fundamentally wrong / bugs in the old version used by that other repository?

Pytorch version?

Hi,

Since t-SNE is increasingly used to visualize neural networks outputs (and their layers), it would be extremely helpful to have an implementation of t-SNE in pytorch, in particular the barnes-hut version that runs in N log N.

Is this something you would be interested in doing?
Thanks!

Can I use a pairwise similarity matrix as input into bhtsne?

hello,
is there a similar way to tsne_p.m for performing fast_tsne (i.e provide the pairwise similarity matrix)?

If not,
is it appropriate if i add an option to computeSquaredEuclideanDistance for computing a gaussian kernel with custom DD (as in exp(-beta * DD[nN + m])) and an equivelent vptree distance,
or euclidean DDistances are the only way for the fast version to work correctly?

Out-of-sample

Is there an OOS extension for t-SNE?

Use of `os.fork` breaks Windows support

Changes in 5d347d1 breaks Windows support, as Windows doesn't support os.fork. Wish I could propose a fix, but I'm not sure if/how to achieve the same sort of functionality in Windows. As a stopgap, I suppose the use of the forked process could just be conditional on platform...

How can i visualize the image data like this?

Different computation of symmetrized conditional probabilities

Apologies if this is not an "issue", but rather a question that I have about the implementation (or my lack of understanding thereof).

The paper says in section 3.1 (and in the pseudocode of Algorithm 1)

set p_ij = (p_{j | i}+p_{i | j}) / 2n

The actual implementation of the symmetrization in tsne.cpp, line 112 seems to be

double sum_P = .0;
for(int i = 0; i < N * N; i++) sum_P += P[i];
for(int i = 0; i < N * N; i++) P[i] /= sum_P;

thus not dividing by 2n, but by the sum of all elements.

Which one is right?

My gut feeling is: It does not matter. Both achieve the same goal. But then, I wonder why the effort of computing the sum is undertaken.

Am I overlooking something here?

order of samples

does this implementation keep the original input order of samples in the output?

There is no module called bhtsne.run_bh_tsne ???

I can install bhtsne using my vcvars32.bat.

But...
I can't run the example in the front page:
data = np.loadtxt("mnist2500_X.txt", skiprows=1)
embedding_array = bhtsne.run_bh_tsne(data, initial_dims=data.shape[1])
However, i can run using bhtsne.tsne(data).

The question is...
Is bhtsne.tsne the same as bhtsne.run_bh_tsne above? Also, setting verbose=True under bhtsne.py doesn't produce verbose text as usual in my SPYDER (python 2.7 anaconda) console.

Segmentation Fault: 11 on NaN data

First, thanks a ton for this research and the implementation! Here's one thing and another I've been using it for.

So, it's totally my fault that I fed NaNs to bh_tsne but the crash was a little mysterious, so I thought I would post an issue in case someone else ran into the same thing. From the Python wrapper I see:

 - point 0 of 30583
Traceback (most recent call last):
  File "bh_tsne/bhtsne.py", line 176, in <module>
    exit(main(argv))
  File "bh_tsne/bhtsne.py", line 167, in main
    verbose=argp.verbose):
  File "bh_tsne/bhtsne.py", line 125, in bh_tsne
    'refer to the bh_tsne output for further details')
AssertionError: ERROR: Call to bh_tsne exited with a non-zero return code exit status, please refer to the bh_tsne output for further details

Then, copying the data.dat and feeding it to the binary with a debugger, I see it crashes here with EXC_BAD_ACCESS (or Segmentation Fault: 11 when you run it without a debugger):

            // Compute Gaussian kernel row
            for(int m = 0; m < K; m++) cur_P[m] = exp(-beta * distances[m + 1]);

distances is 0-length, so I look at my data closer, sorting it and looking for duplicate or weird rows. I noticed some NaNs and verified it with this at the end of TSNE::load_data

    int k = 0;
    for(int i = 0; i < *n; i++) {
        for(int j = 0; j < *d; j++) {
            if(isnan((*data)[k++])) {
                printf("Found NaN at %i x %i!\n", i, j);
            }
        }
    }

A quick hack to clean the data grep -v 'nan' vectors > vectors-clean and it looks like it's working, but now I need to fix the original cause of the problem :)

python wrapper - Cost for each sample

Hello,
I'm working with the python wrapper bhtsne.py, the iris dataset, and on a ubuntu 14.04.
I'm trying to get the costs for each samples (as specified iat the end of bh_tsne()).

I de-commented the last line (of the function)

#read_unpack('{}d'.format(sample_count), output_file)

and adapted it

_read_unpack('{}d'.format(len(results)), output_file)

and put it before the yield, inside a simple print()

However, all the cost are equal to zero. Even when i set a very low number of iterations, and the verbose tell me that the error is still high:

$ ./bhtsne.py -d 2 -p 30 -v -i iris_data.txt -o tsne_test.output --no_pca -m 100
Read the 150 x 4 data matrix successfully!
Using current time as random seed...
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
 - point 0 of 150
Input similarities computed in 0.01 seconds (sparsity = 0.706622)!
Learning embedding...
Iteration 50: error is 45.556438 (50 iterations in 0.02 seconds)
Iteration 99: error is 44.807590 (50 iterations in 0.02 seconds)
Fitting performed in 0.04 seconds.
Wrote the 150 x 2 data matrix successfully!
('costs: ', (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, [... all 0.0 ..], 0.0, 0.0))

What did i missed' or how can i get the costs ?

Thanks

run_bh_tsne fails with FileNotFoundError in Python 3.5

Note that this works in Python 2.7, but not in Anaconda Python 3.5 on OS X and Linux. Something is wrong with the file handling. I can't figure out what, but this file does not exist.

It looks like it opens it in read mode 'rb' and then writes to it? I'm not familiar with doing this.

Traceback (most recent call last):
  File "test/test_tsne.py", line 24, in reduce_dimensions
    result = bhtsne.run_bh_tsne(pca_result)
  File "/Users/rjurney/Software/pinpointcloud_worker/bhtsne/bhtsne.py", line 214, in run_bh_tsne
    for result in bh_tsne(tmp_dir_path, verbose):
  File "/Users/rjurney/Software/pinpointcloud_worker/bhtsne/bhtsne.py", line 159, in bh_tsne
    with open(path_join(workdir, 'result.dat'), 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/0b/74l_65015_5fcbmbdz1w2xl40000gn/T/tmph9x08ku8/result.dat'

Executable file name wrong after compilation

After compiling, I get the file bhtsne but it should be bh_tsne, or at least this is what the python wrapper expect in my case. I am on OSX 10.12.3 and I am compiling using g++ from xcode.

Bhtsne for large datasets

Hello @lvdmaaten ,

I've read on your tSNE homepage that you can handle datasets with up to 30 million examples https://lvdmaaten.github.io/tsne/. I'm currently working in google colab

I currently have a dataset with 2 million examples and each example is a 100-d vector.
Using verbose= False, I get the following:

Using verbose=True as suggested I get:

I'm not sure what this means or how i should proceed. The example with the Mnist dataset works perfect using verbose=False

bhtsne.py:135: ComplexWarning: Casting complex values to real discards the imaginary part

First, I compile bhtsne sucessfully. Then,I run the example code, using the data file 'mnist2500_X.txt'.
I run ：
python bhtsne.py -i mnist2500_x.txt
I get this error:
bhtsne.py:135: ComplexWarning: Casting complex values to real discards the imaginary part.
This error is occured when writting 'data.dat'. Complex values is found after PCA.
I dont know how to fix it. Any suggestions will be appreciated.

No PCA dimensionality reduction in the Python wrapper before calling C++ BH library

I'm working to integrate it to have an homogenous framework. @lvdmaaten Do you think it could be relevant? Thanks!

Can't compile the .exe with visual studio 9.0

Hey,
I'm trying to generate bhtsne.exe by following your instructions, but I keep getting this message on the cmd:
sptree.cpp
sptree.cpp(111): error C3861: 'fmax': identifier not found
sptree.cpp(335): error C3861: 'fmax': identifier not found
NMAKE: fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64\cl.exe"': return code '0x2'

Any idea how to fix it?

transposition based on input method

[[email protected]@login-node03 bhtsne]$ echo -e '1.0\t0.0\n0.0\t1.0' 
1.0	0.0
0.0	1.0
[[email protected]@login-node03 bhtsne]$ echo -e '1.0\t0.0\n0.0\t1.0' | ./bhtsne.py -d 2 -p 0.1
-2227.32653069	6608.48958328
2227.32653069	-6608.48958328
[[email protected]@login-node03 bhtsne]$ echo -e '1.0\t0.0\n0.0\t1.0' > a_file.txt
[[email protected]@login-node03 bhtsne]$ cat a_file.txt 
1.0	0.0
0.0	1.0
[[email protected]@login-node03 bhtsne]$ ./bhtsne.py -d 2 -p 0.1 -i a_file.txt 
-6863.21277159	-1236.73732294
6863.21277159	1236.73732294