Coder Social home page Coder Social logo

klugerlab / fit-sne Goto Github PK

View Code? Open in Web Editor NEW
579.0 36.0 107.0 32.05 MB

Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)

License: Other

R 2.98% MATLAB 3.67% C++ 48.71% C 12.08% Python 3.28% Fortran 29.29%
t-sne visualization big-data fast-algorithm

fit-sne's Introduction

FFT-accelerated Interpolation-based t-SNE (FIt-SNE)

Introduction

t-Stochastic Neighborhood Embedding (t-SNE) is a highly successful method for dimensionality reduction and visualization of high dimensional datasets. A popular implementation of t-SNE uses the Barnes-Hut algorithm to approximate the gradient at each iteration of gradient descent. We accelerated this implementation as follows:

  • Computation of the N-body Simulation: Instead of approximating the N-body simulation using Barnes-Hut, we interpolate onto an equispaced grid and use FFT to perform the convolution, dramatically reducing the time to compute the gradient at each iteration of gradient descent. See the this post for some intuition on how it works.
  • Computation of Input Similarities: Instead of computing nearest neighbors using vantage-point trees, we approximate nearest neighbors using the Annoy library. The neighbor lookups are multithreaded to take advantage of machines with multiple cores. Using "near" neighbors as opposed to strictly "nearest" neighbors is faster, but also has a smoothing effect, which can be useful for embedding some datasets (see Linderman et al. (2017)). If subtle detail is required (e.g. in identifying small clusters), then use vantage-point trees (which is also multithreaded in this implementation).

Check out our paper or preprint for more details and some benchmarks.

Features

Additionally, this implementation includes the following features:

  • Early exaggeration: In Linderman and Steinerberger (2018), we showed that appropriately choosing the early exaggeration coefficient can lead to improved embedding of swissrolls and other synthetic datasets. Early exaggeration is built into all t-SNE implementations; here we highlight its importance as a parameter.
  • Late exaggeration: Increasing the exaggeration coefficient late in the optimization process can improve separation of the clusters. Kobak and Berens (2019) suggest starting late exaggeration immediately following early exaggeration.
  • Initialization: Custom initialization can be provided from Python/Matlab/R. As suggested by Kobak and Berens (2019), initializing t-SNE with the first two principal components (scaled to have standard deviation 0.0001) results in an embedding which often preserves the global structure more effectively than the default random normalization. See there for other initialisation tricks.
  • Variable degrees of freedom: Kobak et al. (2019) show that decreasing the degree of freedom (df) of the t-distribution (resulting in heavier tails) reveals fine structure that is not visible in standard t-SNE.
  • Perplexity combination: The perplexity parameter determines the width of the Gaussian kernel, such that small perplexity values uncover local structure while larger values reveal global structure. Kobak and Berens (2019) show that using combination of several perplexity values, resulting in a multi-scale embedding, can be useful.
  • All optimisation parameters can be controlled from Python/Matlab/R. For example, Belkina et al. (2019) highlight the importance of increasing the learning rate when embedding large data sets.

Installation

R, Matlab, and Python wrappers are fast_tsne.R, fast_tsne.m, and fast_tsne.py respectively. Each of these wrappers can be used after installing FFTW and compiling the C++ code, as below. Gioele La Manno implemented a Python (Cython) wrapper, which is available on PyPI here.

Note: If you update to a new version of FIt-SNE using git pull, be sure to recompile.

OSX and Linux

The only prerequisite is FFTW, which can be downloaded and installed from the website. Then, from the root directory compile the code as:

g++ -std=c++11 -O3  src/sptree.cpp src/tsne.cpp src/nbodyfft.cpp  -o bin/fast_tsne -pthread -lfftw3 -lm -Wno-address-of-packed-member

See here for instructions in case one does not have sudo rights (one can install FFTW in the home directory and provide its path to g++).

Check out examples/ for usage. The Python demo notebook walks one through the most of the available options using the MNIST data set.

Windows

A Windows binary is available here. Please extract to the bin/ folder, and you should be all set.

If you would like to compile it yourself see below. The code has been currently tested with MS Visual Studio 2015 (i.e., MS Visual Studio Version 14).

  1. First open the provided FItSNE solution (FItSNE.sln) using MS Visual Studio and build the Release configuration.
  2. Copy the binary file ( e.g. x64/Release/FItSNE.exe) generated by the build process to the bin/ folder
  3. For Windows, we have added all dependencies, including the FFTW library, which is distributed under the GNU General Public License. For the binary to find the FFTW DLLs, you need to either add src/winlibs/fftw/ to your PATH, or to copy the DLLs into the bin/ directory.

As of this commit, only the R wrapper properly calls the Windows executable. The Python and Matlab wrappers can be trivially changed to call it (just changing bin/fast_tsne to bin/FItSNE.exe in the code), and will be changed in future commits.

Many thanks to Josef Spidlen for this Windows implementation!

Acknowledgements and References

We are grateful for members of the community who have contributed to improving FIt-SNE, especially Dmitry Kobak, Pavlin Poličar, and Josef Spidlen.

If you use FIt-SNE, please cite:

George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, Yuval Kluger. (2019). Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature Methods. (link)

Our implementation is derived from the Barnes-Hut implementation:

Laurens van der Maaten (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15(1):3221–3245. (link)

fit-sne's People

Contributors

alec-hoyland avatar carmot avatar christophh avatar dillonhammill avatar dkobak avatar jlause avatar jlmelville avatar josefflowjo avatar junzhao1990 avatar linqiaozhi avatar pavlin-policar avatar sg-s avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fit-sne's Issues

license

Very nice work! Have you decided how to license this software?

Variable degree of freedom is not supported for 1D

Just realized that our variable degree of freedom is not supported for 1D visualizations. Let me tell you why it could be cool to implement when we talk later today :-)

By the way, here is the 1D MNIST tSNE that I am getting with my default settings
fast_tsne(X50, learning_rate=X50.shape[0]/12, map_dims=1, initialization=PCAinit[:,0])
where X50 is 70000x50 matrix after PCA:

mnist-1d

It's beautiful, but 4d and 9s (salmon/violet) get tangled up. I played a bit with the parameters but couldn't make them separate fully. It seems 1D optimisation is harder than 2D (which is no surprise).

Missing nthreads parameter in python wrapper

I've been looking at the python wrapper fast_tsne.py and as far as I can tell, there's no way to specify the number of threads. Looking at the code, it's also not clear to me how to implement this.

I am aware of pyFIt-SNE, where I can specify the number of threads, but it doesn't have the nice perplexity list that Dmitry implemented.

Higher dimensional embeddings

First, thanks for the great package, and for making it easy to use! I am interested in embedding into three dimensions or higher. I'm using the python API from the latest version of PyPI (0.2.5) and it appears to support this via the no_dims parameter. The resulting array is of the correct shape (as specified by the value of no_dims), but the results appear to be essentially random when visualized (either in 2D slices, or in 3D). I couldn't find any documentation on this, and was hoping you could clarify the parameter/usage for embedding into higher dimensional spaces. Thanks in advance.

Loading val_P/row_P/col_P from files: unused part of the code?

In the Annoy function, there is a large block of code that loads val_P/row_P/col_P vectors from files if they exist: https://github.com/KlugerLab/FIt-SNE/blob/master/src/tsne.cpp#L960. However, the code to save these vectors to the file is commented out: https://github.com/KlugerLab/FIt-SNE/blob/master/src/tsne.cpp#L1161. So all of that does not seem to be doing anything.

Is this a remnant of some debugging? If so, can it perhaps be deleted to simplify the code? Or is it a feature that is not yet fully implemented?

'bind' is not a member of 'std'

When compiling with gcc 7.1.0, I ran into this error message.

src/tsne.cpp: In member function 'int TSNE::computeGaussianPerplexity(double*, int, int, unsigned int**, unsigned int**, double**, double, int, double, int, int, unsigned int)':
src/tsne.cpp:1058:35: error: 'bind' is not a member of 'std'
     threads[t] = std::thread(std::bind(
                                   ^~~~
src/tsne.cpp:1058:35: note: suggested alternative: 'find'
     threads[t] = std::thread(std::bind(
                                   ^~~~
                                   find
src/tsne.cpp: In member function 'void TSNE::computeGaussianPerplexity(double*, int, int, unsigned int**, unsigned int**, double**, double, int, double, unsigned int)':
src/tsne.cpp:1222:34: error: 'bind' is not a member of 'std'
    threads[t] = std::thread(std::bind(
                                  ^~~~
src/tsne.cpp:1222:34: note: suggested alternative: 'find'
    threads[t] = std::thread(std::bind(
                                  ^~~~
                                  find

After some googling, I found that this has to do with the functional header. See the section "Header dependency changes" here.

I added #include <functional> to tsne.cpp and was able to successfully compile.

3D support

Current FIt-SNE only supports 1D and 2D reduction, will the future FIt-SNE support 3D reduction? 3D reduction is also appealing.

CMake configuration is hard coded

Hi, I am trying to explore the tool for embedding large datasets (300K observations, 1024 dimensions). I checked out the code, and modified the Makefile to remove hardcoded paths to the developers home directory. However, afetr that cmake reports an error

(general) ✘-255 ~/src/FIt-SNE [master|✚ 1] 
11:03 $ make
CMake Error: The source directory "/Users/guha/src/FIt-SNE" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.

Do you know hwo this could be fixed?

How to properly pass the perplexity_list parameter?

Hi there! Thank you for this awesome implementation of tSNE!

I would like to know how to use the perplexity_list parameter, since there is no further documentation on it. Is it possible to get multiple perplexities for the same tSNE plot? If so, does the order of the values in the list represent anything to the algorithm?

Installation problems on Linux

Hi, I had some difficulties installing FIt-SNE on our Linux server, which I thought I would mention here in case others have the same problems.

First, I installed FFTW from source, using the source tarball from the provided link. I installed it into my home directory using the standard commands ./configure --prefix=$HOME, make, and make install.

Then, I cloned the FIt-SNE GitHub repository and tried running the g++ compiler command provided on the readme page. However, this gave the following errors:

In file included from src/tsne.cpp:39:0:
src/nbodyfft.h:7:19: fatal error: fftw3.h: No such file or directory
compilation terminated.
In file included from src/nbodyfft.cpp:3:0:
src/nbodyfft.h:7:19: fatal error: fftw3.h: No such file or directory
compilation terminated.

and

/usr/bin/ld: cannot find -lfftw3
collect2: error: ld returned 1 exit status

I managed to fix this by adding extra arguments to specify the include and lib directories (where the FFTW files were installed) in the g++ call, as follows:

g++ -std=c++11 -O3 src/sptree.cpp src/tsne.cpp src/nbodyfft.cpp -I<path_to_include_directory> -L<path_to_lib_directory> -o bin/fast_tsne -pthread -lfftw3 -lm

I think it would be helpful to include a few comments on this on the readme page, for users who do not have a lot of experience using compilers.

FIt-SNE with MATLAB wrapper significanlty slower than built-in implementation...????

Hi,

I'm playing around the fast_tsne wrapper, and trying to embed a N=5000 test dataset

X = [randn(5e3,1) randn(5e3,1)];
X(1:3:end,1) = X(1:3:end,1)+10;
X(2:3:end,2) = X(2:3:end,2)+10;
 plot(X(:,1),X(:,2),'k.')

Screen Shot 2020-01-06 at 2 01 03 PM

I can embed it using fast_tsne as follows:

opts.nthreads = 8;
R = fast_tsne(X,opts);
Elapsed time is 39.793197 seconds.

This is considerably slower than MATLAB's built-in (which uses bhtsne). # of iterations is 1000 in both.

tic; R = tsne(X); toc
Elapsed time is 26.548633 seconds.

What I did notice is that the fast_tsne binary never used more than a single core. Is this expected behaviour? Am I doing something wrong? I don't understand why it's slower than expected.

segfault: invalid permissions

After installing the current version, using R 3.5.0 and Fedora 28, the example code in examples/test.R runs fine. However, when attempting to analyze a larger dataset (a matrix with dimensions 29759 x 33650), the following error results:

*** caught segfault ***
address 0x7fd57089d000, cause 'invalid permissions'

Traceback:
1: writeBin(tX, f)
2: fftRtsne(X)
An irrecoverable exception occurred. R is aborting now ...

Any idea what could be the problem? RAM usage is at the point of failure about 59%, so it does not seem to be an out of memory error. Running R as root did not solve the problem either.

KNN algorithm when using sigma and K as inputs

When input parameters sigma and K are provided and theta>0 is used, then the code always uses VP trees, independent of what knn_algo is set to. Is this intended? I could fix it, or am I missing something here?

See https://github.com/KlugerLab/FIt-SNE/blob/master/src/tsne.cpp#L182. Even though the ANNOY version of computeGaussianPerplexity() function does support sigma as input.

By the way, this option is an interesting addition to the BH t-SNE codebase. I have also been experimenting with fixed sigmas to better understand the influence of the perplexity choice on t-SNE performance.

As an aside, I noticed that using small K and large sigma (something like sigma=100 and K=10), which effectively yields almost binary affinity matrix, often yields quite good results, and similar to perplexity=10. This becomes something like a K-nearest-neighbours graph embedding algorithm...

Error when load similarity matrix

The speed of FIt-SNE is amazing. Can FIt-SNE accept precomputed similarity matrix? I used the Matlab wrapper and inputted a precomputed similarity matrix with the "load" option, but errors were reported by showing that cannot find the file.

Default learning rate

Embedding a large dataset (let's say n=1mln) with FIt-SNE using default parameters will yield a horrible result. Now that we understand it pretty well and now that there are papers published suggesting a very easy fix, why don't we change the default learning rate? What's the benefit of keeping it set to 200 as it was in the original t-SNE implementation?

My suggestion: if n>=10000 and if the learning rate is not explicitly set, then the wrapper sets it to n/12. The cutoff can be smaller than 10000 but in my experience smaller data sets work fine with learning rate 200, and 10000 is a nice round number.

@pavlin-policar I suggest to adopt the same convention in openTSNE. What do you guys think?

Build problems on Windows

Greetings,

When I try to build Flt-SNE using either VS 2015 or VS 2017, I get a host of errors and warnings:

1>------ Build started: Project: FItSNE, Configuration: Debug x64 ------
1>  stdafx.cpp
1>  mman.c
1>  tsne.cpp
1>c:\users\tantr\workplace\fit-sne\src\progress_bar\progressbar.hpp(35): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(110): warning C4244: 'argument': conversion from 'time_t' to 'unsigned int', possible loss of data
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(298): warning C4244: '=': conversion from 'double' to 'int', possible loss of data
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(300): warning C4244: '=': conversion from 'double' to 'int', possible loss of data
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(303): warning C4244: '=': conversion from 'double' to 'int', possible loss of data
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(333): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(334): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(372): error C2065: 'CLOCK_MONOTONIC': undeclared identifier
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(372): error C3861: 'clock_gettime': identifier not found
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(388): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(403): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(417): error C2065: 'CLOCK_MONOTONIC': undeclared identifier
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(417): error C3861: 'clock_gettime': identifier not found
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(487): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(492): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(498): error C2065: 'CLOCK_MONOTONIC': undeclared identifier
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(498): error C3861: 'clock_gettime': identifier not found
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(514): error C2065: 'CLOCK_MONOTONIC': undeclared identifier
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(514): error C3861: 'clock_gettime': identifier not found
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(188): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(228): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(251): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(266): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(273): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(343): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(350): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(357): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(512): warning C4477: 'printf' : format string '%ld' requires an argument of type 'long', but variadic argument 2 has type 'time_t'
1>  c:\users\tantr\workplace\fit-sne\src\tsne.cpp(512): note: consider using '%lld' in the format string
1>  c:\users\tantr\workplace\fit-sne\src\tsne.cpp(512): note: consider using '%Id' in the format string
1>  c:\users\tantr\workplace\fit-sne\src\tsne.cpp(512): note: consider using '%I64d' in the format string
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(557): error C4996: 'sprintf': This function or variable may be unsafe. Consider using sprintf_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(1769): note: see declaration of 'sprintf'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(558): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(594): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(608): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(635): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(654): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(689): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(722): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(737): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(766): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(787): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(832): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(884): error C4996: 'sprintf': This function or variable may be unsafe. Consider using sprintf_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(1769): note: see declaration of 'sprintf'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(885): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1006): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1037): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1234): warning C4267: 'initializing': conversion from 'size_t' to 'unsigned int', possible loss of data
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1235): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1250): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1280): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1341): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1368): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1407): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1411): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1453): warning C4018: '<=': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1471): warning C4018: '<=': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1570): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>c:\users\tantr\workplace\fit-sne\src\tsne.cpp(1671): error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
1>  c:\program files (x86)\windows kits\10\include\10.0.10240.0\ucrt\stdio.h(205): note: see declaration of 'fopen'
1>  sptree.cpp
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(53): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(54): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(81): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(106): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(110): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(293): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(313): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(322): warning C4244: '=': conversion from 'double' to 'int', possible loss of data
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(366): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(392): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(394): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(396): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(401): warning C4018: '<': signed/unsigned mismatch
1>c:\users\tantr\workplace\fit-sne\src\sptree.cpp(403): warning C4018: '<': signed/unsigned mismatch
1>  nbodyfft.cpp
1>  Generating Code...
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

Any ideas by chance on how to fix the errors? Thank you in advance.

Allow to provide initialization

There is currently no way to provide some particular initialization, despite TSNE::run() seemingly supporting this option via skip_random_init=TRUE. The relevant code in the main() is commented out. It would be great to add this option.

One approach would be to modify the input file format such that if the end of file is not reached when rand_seed is read, then the remaining part of the file is taken by the initialization Y. This will allow to use the same file format whether one chooses to provide an initialization or not.

(For my own purposes, I have previously modified bh_tsne to work like that.)

Speed issues/questions

I noticed two issues with speed. I am working with a dataset with n=14k and dimensionality 50.

  1. The gradient descent iterations become slower and slower as they progress. This is most pronounced with low perplexities. E.g. with perplexity=10, I get 50 iterations in 0.634327 seconds in the beginning, but it slowly increases to 5-6 seconds per 50 iters towards iteration 1000. Why the slowdown?

    This is less pronounced for larger perplexities. For perpl=100, it grows from 1.3 s to 2-3 s. For perpl=1000, it does not noticeably grow at all (stays at around 9.5 s per 50 iters).

  2. For very larger perplexity values (e.g. 1000), the code spends a lot of time after it computes all affinities via Annoy. What I mean is that

 Building Annoy tree...
Done building Annoy tree. Begin nearest neighbor search... 
Calculating dynamic kernels using perplexity 
parallel (8 threads):
 - point 0 of 14177, most recent beta calculated is 15.177124 
 - point 10000 of 14177, most recent beta calculated is 35.029297 

this part works in several seconds (maybe 10), but then it takes around 2 minutes until the output continues with

Using the given initialization
Exagerating Ps by 12.000000
Input similarities computed (sparsity = 0.263060)!
Learning embedding...
Using FIt-SNE approximation.
...

This is approximately as much time as is taken by the 1000 gradient descent iterations. What happens during this seeming "pause"? Is the symmetrizeMatrix() taking so much time for larger perplexity values?

Tidy up the terminal output before gradient descent begins

The terminal output before the gradient descent is a bit of a mess and always annoys me slightly :-)

fast_tsne data_path: data.dat
fast_tsne result_path: result.dat
fast_tsne nthreads: 8
Read the following parameters:
	 n 70000 by d 50 dataset, theta 0.500000,
	 perplexity 50.000000, no_dims 2, max_iter 1000,
	 stop_lying_iter 250, mom_switch_iter 250,
	 momentum 0.500000, final_momentum 0.800000,
	 learning_rate 10000.000000, K -1, sigma -1.000000, nbody_algo 2,
	 knn_algo 1, early_exag_coeff 12.000000,
	 no_momentum_during_exag 0, n_trees 50, search_k 7500,
	 start_late_exag_iter -1, late_exag_coeff -1.000000
	 nterms 3, interval_per_integer 1.000000, min_num_intervals 50, t-dist df 0.500000
Read the 70000 x 50 data matrix successfully. X[0,0] = 0.479432
Read the initialization successfully.
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 7500
Going to allocate memory. N: 70000, K: 150, N*K = 10500000
Building Annoy tree...
Done building tree. Beginning nearest neighbor search... 
parallel (8 threads):
[===========================================================>] 99% 10.518s
Symmetrizing...
Using the given initialization.
Exaggerating Ps by 12.000000
Input similarities computed (sparsity = 0.002962)!
Learning embedding...
Using FIt-SNE approximation.
Iteration 50 (50 iterations in 3.58 seconds), cost 6.430274
...

It would be nice to tidy this up a little to make it a little more user-friendly.

  1. The list of parameters is ugly, this can be formatted better.
  2. "Will use momentum during exaggeration phase" -- this is the default so maybe rather say nothing unless momentum is not used?
  3. "Using perplexity, not the manually set kernel width. K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored." -- again the default, so maybe rather not print this
  4. Would be nice to see the time spent on building Annoy tree once its built.
  5. "parallel (8 threads):" -- this is printed above already, maybe don't need to repeat?
  6. "Symmetrizing..." -- would definitely be nice to see the time spent, because this can take quite some time in some cases.
  7. "Input similarities computed (sparsity = 0.002962)!" -- this should be printed after symmetrizing is done.
  8. If no exaggeration is used, no need to print "exaggerating by 1"
  9. In general line breaks, "...", etc. can be used more consistently.
    etc.

Windows binary, where to install

For the Windows binary you provide, you state the following:
"A Windows binary is available here. Please extract to the bin/ folder, and you should be all set."

Forgive my ignorance, but I still don't understand where put the .exe and .dll files. That is to say, where should I put them (relative to my R installation directory) so R will find them?

How to make clusters more compact?

Dear all,

first of all, thanks for the great tool! I am currently developing upon it and it has been really nice to use so far.

When comparing it to "classical" t-SNE visualizations, I found the classic visualization (i.e., the 2D embedded coordinates) to be somewhat more compact. Put differently, the clusters were more strongly separated.
This, in turn, is relevant as I would like to cluster the 2D embedded data.

Could you please let me know if and, if so, how this could be approached using FIt-SNE?
I checked the documentation and it seems that the late exageration parameter might be useful here.
However, I have no intuition what would be a good cooefficient there to start with and at which iteration to start.

Your thoughts and input are greatly appreciated!

Best,

Cedric

P.S. I am currently using v.1.1.0 and the python wrapper.

Error about dimension

When I running the test.ipynb,I encounter an error

ValueError: 'c' argument has 70000 elements, which is not acceptable for use with 'x' with size 25000, 'y' with size 25000.

Then I run another data,I encounter the same error, no matter how much cell number from the initial data,after fast_tsne,the Z.shape is (25000,50)

Syntax issue with setup.py

It looks like there is a syntax issue in the setup.py file for install the package via pip (Pypi). That is, if you try to install the most recent version of the package via pip install fitsne, you get the following error message:

 Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/b3/j6sb_hk15t710xrbs7x3fvsc0000gn/T/pip-install-qebngw5i/fitsne/setup.py", line 16
        download_url=f"https://github.com/KlugerLab/pyFIt-SNE/archive/{__version__}.tar.gz"
                                                                                        ^
    SyntaxError: invalid syntax

The version number is missing. You need to write is as:

"https://github.com/KlugerLab/pyFIt-SNE/archive/{"+__version__+"}.tar.gz"

Handling of `nthreads`

I am a bit confused by how nthreads parameter is treated.

First, what's the reason for passing it as an argument to main() instead of writing it into a file?

Second, if no parameter is passed, then nthreads=0 by default. This is clearly unusable so in this case you do

    if (nthreads == 0) {
        nthreads = std::thread::hardware_concurrency();
    }

but this is done separately in each function that uses nthreads as opposed to in main(). Why?

Third, the above snippet is not included into evaluateError and evaluateErrorFft, is it just a mistake? Should we maybe move that snippet into main() from everywhere?

Also, fourth, I noticed that there are several parallel loops that do not use PARALLEL_FOR; is it on purpose?

Result.dat

Hello

Thanks for providing fit-tsne package. My question is when run fast-tsne in python, I can not find result.dat file in folder while I do have data.dat, which is not empty.

Can you provide some suggestions ??

Best
He Zhao

No multithreading after running RunTSNE with "test.method="FIt-SNE", nthreads = 8"

Hi,
I am using seurat v3 and FIt-SNE v1.1.0 but it seems that only one thread is used when running fron Seurat:

seurat_object <- RunTSNE(seurat_object, dims = 1:40, test.method="FIt-SNE", nthreads = 4)

Here is a reproducible example

library(Seurat)
download.file("https://www.dropbox.com/s/63gnlw45jf7cje8/pbmc3k_final.rds?dl=1", "~/Downloads/pbmc3k_final.rds")
pbmc <- readRDS(file = "~/Downloads/pbmc3k_final.rds")
pbmc <- RunTSNE(pbmc, dims = 1:40, test.method="FIt-SNE", nthreads = 4)

Any help?
Thanks

Runtime of the cost computation

While running the code on the n=1.3mln dataset, I noticed that the CPU usage looks like that:

screenshot from 2018-09-20 15-09-04

There are exactly 50 short valleys (each valley takes ~1s) with short periods of 100% CPU for all cores (also ~1s) -- this I assume corresponds to the attractive force computation (parallel on all cores) and the repulsive force computation (currently not parallel). Together they take ~2s per iteration, i.e. ~100s per 50 iterations. But after 50 iterations there is a long ~30s period when something happens on one core. WHAT IS IT?!

The only thing that is done every 50 iterations, is the cost computation. Does the cost computation really take so long? Can it be accelerated? If not, and if it really takes 25% of the computation time, then I think it should be optional and switched off by default (at least for large datasets).

"Wrapper passed wrong version number" error (Seurat v2)

Hello,

I tried to install and run fast_tsne in bin folder, but it fails saying the following everytime:
=============== t-SNE v1.1.0 ===============
Wrapper passed wrong version number: --version
Could you help me understand why this could be happening? I am using a MacOS and compiled FFTW from source without any errors.

Thank you
Asma

unable to reproduce example

While trying to reproduce your example code in atom, the result of the first figure looks nothing like yours. I don't know where I might be changing things. I'm doing this in Windows. Here is how my code looks like:

import numpy as np
import sys
import os
import pylab as plt
sys.path.append('path to /FIt-SNE')
from fast_tsne import fast_tsne

from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float64') / 255
x_test = x_test.reshape(10000, 784).astype('float64') / 255
X = np.concatenate((x_train, x_test))
y = np.concatenate((y_train, y_test))
print(X.shape)
X = X - X.mean(axis=0)
U, s, V = np.linalg.svd(X, full_matrices=False)
X50 = np.dot(U, np.diag(s))[:, :50]

print(X50.shape)

PCAinit = X50[:, :2] / np.std(X50[:, 0]) * 0.0001
col = np.array(['#a6cee3', '#1f78b4', '#b2df8a', '#33a02c', '#fb9a99',
'#e31a1c', '#fdbf6f', '#ff7f00', '#cab2d6', '#6a3d9a'])

Z = fast_tsne(X50, perplexity=50, seed=42)

print(Z.shape)

plt.figure(figsize=(5, 5))
plt.scatter(Z[:, 0], Z[:, 1], c=col[y], s=1)
plt.tight_layout()
plt.show()

figure_1

Interpolation times fluctuating

When running MNIST with df=.5, lr=1000, max_iter=10000, I noticed that the interpolation times go fluctuate quite wildly:

Iteration 8900 (50 iterations in 156.69 seconds), cost 3.635826
Iteration 8950 (50 iterations in 144.49 seconds), cost 3.633185
Iteration 9000 (50 iterations in 167.69 seconds), cost 3.634107
Iteration 9050 (50 iterations in 216.43 seconds), cost 3.632257
Iteration 9100 (50 iterations in 98.15 seconds), cost 3.632917
Iteration 9150 (50 iterations in 87.34 seconds), cost 3.630620
Iteration 9200 (50 iterations in 115.28 seconds), cost 3.629794
Iteration 9250 (50 iterations in 213.32 seconds), cost 3.631249

This does not happen early in the optimization. The runtime grows exactly monotonically until it reaches about 10 seconds per 50 iterations (somewhere around iteration=1200). After that it begins to fluctuate and does it more and more so.

Any idea why this might be happening?

Parameters in fast_tsne.R is disordering

Hi, thanks for your nice tool.
When I test it with examples/test.R, I found the something wrong:

fast_tsne data_path: 1.1.0
fast_tsne result_path: C:\Users\ADMINI~1\AppData\Local\Temp\RtmpqkV3Br\test_data_1e5068e340bd.dat
fast_tsne nthreads: 12
Error: could not open data file.
Done.

The result above indicates that data_path is the version number. I think code in https://github.com/KlugerLab/FIt-SNE/blob/master/fast_tsne.R#L199 should be flag= system2(command=fast_tsne_path, args=c(data_path, result_path, version_number, nthreads)); . And this change would make program go well.

anaconda does not recognize installation

I have followed the installation instructions as outlined in the README file and it all seems to have worked without errors. However, when I try to load the library in Python I get the error "ModuleNotFoundError: No module named 'fast_tsne'". Is there a trick to make my anaconda python environment find the installation?

As an alternative I installed PyFIt-SNE (https://github.com/KlugerLab/pyFIt-SNE) directly through anaconda, but that library does not seem take the same commands as the version in this repository.

Undefined reference to `clock_gettime' when install

Hi,

I got the following error when installing:

$ g++ -std=c++11 -O3  src/sptree.cpp src/tsne.cpp src/nbodyfft.cpp  -o bin/fast_tsne -pthread -lfftw3 -lm
/tmp/cczspn62.o: In function `TSNE::run(double*, int, int, double*, int, double, double, int, bool, int, int, int, int, double, int, int, double, double*, double*, bool, int, double, int, int)':
tsne.cpp:(.text+0x8698): undefined reference to `clock_gettime'
tsne.cpp:(.text+0x8c8b): undefined reference to `clock_gettime'
tsne.cpp:(.text+0x9be8): undefined reference to `clock_gettime'
collect2: error: ld returned 1 exit status

You can fix that by adding -lrt during compilation.

Precomputed distance matrix (take 2)

Hi, I had previously asked about using FIt-SNE with a precomputed distance matrix and I was pointed to an earlier version which did support it. Thanks for that, but I now want something else.

I see that there is now an "degrees of freedom" parameter that makes the heavy tails heavier, and I'm interested in using that if possible, and I wonder if I can do that with a precomputed distance matrix.

Thanks for any help!

Changing output despite using the same initialization

I noticed that in some cases I get slightly different output if I rerun the exact same code, even though I provide the same initialization. I am wondering where the additional randomness can come from. I am thinking maybe the randomness comes from Annoy. If so, I think Annoy allows to set the seed.

Should we pass the rand_seed parameter (that we have anyway) to Annoy?

annoylib does not compile using g++ 9

Just tried installing FIt-SNE on a new laptop with a fresh Ubuntu installation, and the compilation failed with some errors in annoylib. Googling led to this issues: spotify/annoy#310. Downloading the current annoylib.h fixed the problem. Are you OK with updating the version of Annoy included in FIt-SNE to the latest one? Then I can do a PR.

Pass on theta

Typo in here:

writeBin( as.numeric(0.5), f,size= 8) #theta

Should be
writeBin( as.numeric(theta), f,size= 8) #theta
I think.

Tried to fix via a pull request, but got a permission denied.

Josef

Default value of stop_lying_iter

In http://jmlr.org/papers/volume15/vandermaaten14a/vandermaaten14a.pdf, early exaggeration 12 is used for the first 250 iterations. This is also the (unchangeable) default in https://github.com/lvdmaaten/bhtsne/blob/master/tsne.h#L45. However, FIt-SNE's default in all three wrappers is 200.

Is there any particular reason to use 200? Otherwise I'd set the default to 250, for historical consistency.

PS. When I was checking the parameter values now, I noticed that mom_switch_iter, momentum, final_momentum, and learning_rate are all hard-coded into the C++ code. Wouldn't it make sense to make them adjustable in wrappers? It's easy to do, but would change the format of the input files (shouldn't be a big deal?). Not that I need to change any of these four parameters :-)

Export all iterations to be able to create animations

When preparing a public talk, I often want to include an animation of how t-SNE develops during gradient descent. I thought I can call fast_tsne in a loop with max_iter=1 (loading the similarities from a file), but this yields a worse final result than running fast_tsne in one go; I assume this is because of the adaptive gradient descent. So it would be great to be able to run fast_tsne and export all max_iter iterations. What do you think?

Error: Invalid knn_algo param

Hi,

I'm trying to use FIt-SNE within Seurat and I'm following the instructions they provide.

The first issue is that the github FIt-SNE fails to compile, and it complains about the following error:

src/parallel_for.h:29:37: error: ‘bind’ is not a member of ‘std’

I solved this by including #include <functional>, and then it was possible to compile. And the binary seems to work as expected. However, when I try to run it as per Seurat, I've got the following error:

Invalid knn_algo param Error in fftRtsne(X = as.matrix(x = data.use), dims = dim.embed, rand_seed = seed.use, : tsne call failed

I have run several datasets and I always get the same error. Any ideas?

Report KL divergence during optimization

It would be really convenient if the optimization reported on the current KL divergence (or at least the final one), like scikit-learn or Multicore tSNE does. This could then be used later on for e.g. automatically determining perplexity (Cao, Wang 2017).

Implementing this should be trivial. We would need to add a logging for the error and then to actually compute the KL divergence, we'd need to add a few returns to the gradient code.

To make things convenient for the implementation, we can rewrite the KL divergence as

image

where \hat{q_{ij}} denotes the unnormalized q_{ij}s. This is exactly what scikit-learn does (and multicore tSNE if I remember correctly), so their code can be used as a reference if needed.

This is really easy to compute with the current implementation, because we can compute the first term during the main gradient loop, and if we keep a sum_P variable and return sum_Q from the negative gradient methods, we can easily compute the second term as well. Literally, everything is already in place for this to be implemented, a couple returns need to be added, and the actual expression written into code.

Note that during the early (and late) exaggeration phases the error will be incorrect, because we P will be scaled, but it is also really easy to work out a correction for this:

image

where the first term would be the KL divergence that we compute during optimization and the second term is the correction.

I can look into it when time permits unless somebody beats me to it :)

Lots of debugging files are created when using exact t-SNE

I noticed that when exact t-SNE is used (theta=0), the code creates a lot of debugging files in the temp/ folder. This is not the most pressing issue of course, but I was a bit confused to see 1000 files like exact_gradient0.txt in the temp/ when running the exact t-SNE.

This happens in computeExactGradient which was modified compared to the bhtsne to do exactly this. There is also computeExactGradientTest which also creates some similar (but not identical) files. However, computeExactGradientTest is never run because bool measure_accuracy = false; globally.

I assume that computeExactGradientTest is some debugging function which is OK but I am wondering if the non-debugging version computeExactGradient can be reverted back to the bhtsne state without any files created in temp/.

If the answer is yes, then should the computeExactGradientTest be changed to save all the stuff that computeExactGradient is now saving?

Or maybe computeExactGradient can be modified to do the logging only if measure_accuracy is set to true?

Memory usage increasing non-linearly with iterations

Thanks for making this awesome implementation!

I am using the R wrapper of FITSNE on a Linux Mint 18 install.

I have successfully run the examples included but i notice that once I start to run this package on my own data (90k x 42 dataframe converted to a matrix) I am encountering memory issues.

With each set of 50 iterations the memory requirement increases by about 150Mb. Time to calculate each iteration also increases. Below 1000 iterations this isn’t that much of an memory usage but once I'm getting up to 2500 FIt-sne has used 32Gb of RAM and is munching through swap space just as rapidly.

Is this the expected behaviour? I could understand it scaling with the size of the array or other options but number of iterations seems to be an odd thing to have proportional to memory usage.

The exact R code used to call fitsne is:

newtsne <- fftRtsne(zdatanew, 
                    dims=2, perplexity=30, theta=0.5,
                    check_duplicates=TRUE,
                    max_iter=5000,
                    fft_not_bh = TRUE,
                    ann_not_vptree = TRUE,
                    stop_lying_iter=250,
                    exaggeration_factor=12.0, no_momentum_during_exag=FALSE,
                    start_late_exag_iter=-1.0,late_exag_coeff=1.0,
                    n_trees=50, search_k = -1,rand_seed=-1,
                    nterms=3, intervals_per_integer=1, min_num_intervals=50, 
                    data_path=NULL, result_path=NULL,
                    fast_tsne_path='/bin/fast_tsne', nthreads=0)

Id really appreciate any help/insight you can give! I'm still seeing pinching on my plot at 2500 runs which I understand suggests I need more iterations.

Multithreading for gradient descent not working

A colleague of mine @msayhan downloaded the current master and is running t-SNE on our server under Python (24 cores; inside a docker container; whatever). It looks like the Annoy search is multithreaded (all 24 cores are working at 100% when t-SNE is started), but once Annoy finishes, only one core is active during the gradient descent.

Given that Annoy calls are multithreaded "manually" and the gradient descent is multithreaded using PARALLEL_FOR, I suspect that the latter does not work for him for some reason.

How can we troubleshoot that?

fast_tsne returns NaN for all x and y

I've been using the compiled version of fast_tsne for a while, but recently I updated and recompiled it and realized that it's not working for me anymore. Commit bd10ce4 is the problematic one, it basically saves only NaN in result.dat and the cost in the output is also NaN:

============== t-SNE v1.0.1 ===============
fast_tsne data_path: data.dat
fast_tsne result_path: result.dat
fast_tsne nthreads: 4
Read the following parameters:
n 5989 by d 50 dataset, theta 0.500000,
perplexity 30.000000, no_dims 2, max_iter 1000,
stop_lying_iter 250, mom_switch_iter 250,
momentum 0.500000, final_momentum 0.800000,
learning_rate 200.000000, K -1, sigma -1.000000, nbody_algo 2,
knn_algo 1, early_exag_coeff 12.000000,
no_momentum_during_exag 0, n_trees 50, search_k 4500,
start_late_exag_iter -1, late_exag_coeff -1.000000
nterms 3, interval_per_integer 1.000000, min_num_intervals 50, t-dist df 0.000000
Read the 5989 x 50 data matrix successfully. X[0,0] = 0.335302
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width. K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 4500
Going to allocate memory. N: 5989, K: 90, N*K = 539010
Building Annoy tree...
Done building tree. Beginning nearest neighbor search...
parallel (4 threads):
[===========================================================>] 99% 0.844s
Symmetrizing...
Using current time as random seed...
Randomly initializing the solution.
Y[0] = -0.000211
Exaggerating Ps by 12.000000
Input similarities computed (sparsity = 0.023972)!
Learning embedding...
Using FIt-SNE approximation.
Iteration 50 (50 iterations in 0.80 seconds), cost nan
Iteration 100 (50 iterations in 0.77 seconds), cost nan
Iteration 150 (50 iterations in 0.78 seconds), cost nan
Iteration 200 (50 iterations in 0.77 seconds), cost nan
Iteration 250 (50 iterations in 0.77 seconds), cost nan
Unexaggerating Ps by 12.000000
Iteration 300 (50 iterations in 0.78 seconds), cost nan
Iteration 350 (50 iterations in 0.77 seconds), cost nan
Iteration 400 (50 iterations in 0.78 seconds), cost nan
Iteration 450 (50 iterations in 0.78 seconds), cost nan
Iteration 500 (50 iterations in 0.77 seconds), cost nan
Iteration 550 (50 iterations in 0.78 seconds), cost nan
Iteration 600 (50 iterations in 0.77 seconds), cost nan
Iteration 650 (50 iterations in 0.86 seconds), cost nan
Iteration 700 (50 iterations in 0.80 seconds), cost nan
Iteration 750 (50 iterations in 0.85 seconds), cost nan
Iteration 800 (50 iterations in 0.81 seconds), cost nan
Iteration 850 (50 iterations in 0.92 seconds), cost nan
Iteration 900 (50 iterations in 0.81 seconds), cost nan
Iteration 950 (50 iterations in 0.90 seconds), cost nan
Iteration 1000 (50 iterations in 0.78 seconds), cost nan
Wrote the 5989 x 2 data matrix successfully.
Done.

Before I was using commit 4dd27e0 and everything wsa working fine, with exactly the same data.dat file:

=============== t-SNE ===============
fast_tsne data_path: data.dat
fast_tsne result_path: result.dat
fast_tsne nthreads: 4
Read the following parameters:
n 5989 by d 50 dataset, theta 0.500000,
perplexity 30.000000, no_dims 2, max_iter 1000,
stop_lying_iter 250, mom_switch_iter 250,
momentum 0.500000, final_momentum 0.800000,
learning_rate 200.000000, K -1, sigma -1.000000, nbody_algo 2,
knn_algo 1, early_exag_coeff 12.000000,
no_momentum_during_exag 0, n_trees 50, search_k 4500,
start_late_exag_iter -1, late_exag_coeff -1.000000
nterms 3, interval_per_integer 1.000000, min_num_intervals 50
Read the 5989 x 50 data matrix successfully. X[0,0] = 0.335302
Using current time as random seed...
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width. K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 4500
Going to allocate memory. N: 5989, K: 90, N*K = 539010
Building Annoy tree...
Done building tree. Beginning nearest neighbor search...
parallel (4 threads):
[============================================================] 100% 0.864s
Symmetrizing...
Randomly initializing the solution.
Exaggerating Ps by 12.000000
Input similarities computed (sparsity = 0.023972)!
Learning embedding...
Using FIt-SNE approximation.
Iteration 50 (50 iterations in 0.72 seconds), cost 4.998925
Iteration 100 (50 iterations in 0.77 seconds), cost 4.659932
Iteration 150 (50 iterations in 0.82 seconds), cost 4.645344
Iteration 200 (50 iterations in 0.76 seconds), cost 4.636345
Iteration 250 (50 iterations in 0.76 seconds), cost 4.622303
Unexaggerating Ps by 12.000000
Iteration 300 (50 iterations in 0.74 seconds), cost 2.721241
Iteration 350 (50 iterations in 0.81 seconds), cost 2.308060
Iteration 400 (50 iterations in 0.94 seconds), cost 2.093742
Iteration 450 (50 iterations in 0.90 seconds), cost 1.965181
Iteration 500 (50 iterations in 1.33 seconds), cost 1.875860
Iteration 550 (50 iterations in 1.68 seconds), cost 1.809087
Iteration 600 (50 iterations in 2.08 seconds), cost 1.765513
Iteration 650 (50 iterations in 2.71 seconds), cost 1.729019
Iteration 700 (50 iterations in 2.68 seconds), cost 1.708275
Iteration 750 (50 iterations in 2.96 seconds), cost 1.693048
Iteration 800 (50 iterations in 3.96 seconds), cost 1.681363
Iteration 850 (50 iterations in 4.81 seconds), cost 1.673190
Iteration 900 (50 iterations in 4.28 seconds), cost 1.660609
Iteration 950 (50 iterations in 4.76 seconds), cost 1.652336
Iteration 1000 (50 iterations in 6.39 seconds), cost 1.645526
Wrote the 5989 x 2 data matrix successfully.
Done.

OSError: [Errno 22] Invalid argument

I am getting the error. Please help

Traceback (most recent call last):
  File "data.py", line 27, in <module>
    X_embed = fast_tsne(X_tfidf.toarray())
  File "/Users/mayukh-imac/Documents/ML/FIt-SNE/fast_tsne.py", line 71, in fast_tsne
    f.write(X.tobytes())
OSError: [Errno 22] Invalid argument

Applicability to binary representations?

Hi, I've been playing with the code and it's speed is very impressive. Most of the example employ real valued representations of the objects. Is the method applicable to a binary representation? In particular, I guess the similarity calculations should be appropriate for the representation (e.g., jaccard for binary data, euclidean for real valued data etc).

So I guess the question is, is it appropriate to run this on binary data?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.