Coder Social home page Coder Social logo

spacespanningnetwork's Introduction

SpaceSpanningNetwork

The problem

We are trying to maximize the range of MLPs (tanh), at fixed topology. The range of a network is defined as the set of points it "typically" outputs given a gaussian input vector. We are interested in the cases where the input dimension is much less than the output dimension.

There are many possible measures of how good a certain range is. I opted for the following formal definition of the problem. We are looking for the set of weights and biases $\tilde{p}$ , that given :

$NN_p : \mathbb{R}^{d_{in}} \rightarrow \mathbb{R}^{d_{out}}, \quad d_{in} \ll d_{out} \newline$

$Y \hookrightarrow \mathcal{N}(0,1)^{d_{out}} \quad$

$n \in \mathbb{N}^*, \quad \forall i \in [1, n] \quad X_i \hookrightarrow \mathcal{N}(0,1)^{d_{in}}, \quad (X_i)_i \quad i.i.d.\newline$

Satisfies:

$ \tilde{p} \quad = \quad argmin_p[ \quad E_{Y}( \quad E_{(X_1,..X_n)}[ \quad min_{i\in[1,n]}( \quad ||NN_p(X_i) - Y||_2 \quad )])]$


We will refer to the quantity in the $argmin_p$ as $SF(p)$, the Space Filling capability of the networks with parameters $p$. Even though it depends on $n$, it will not appear explicitly. The same value of $n$ will be used in the code when two experiences are compared.

This expression means that for a likely $Y \in \mathbb{R}^{d_{out}}$, the network $NN_{\tilde{p}}$ frequently outputs vectors close to $Y$ given likely inputs $X \in \mathbb{R}^{d_{in}}$. The purpose of $n$ is to have both a "good coverage" of space with the $(X_1,..X_n)$, but also not to cover "too much" of it so that a vector $Y$ that the network can reach but rarely is penamizing in the measure. Taking $n = \infty$ provides good insights.

This investigation was motivated by the fact that as the number of layers of a randomly initialized* MLP grows, the output range shrinks, and quickly collapses to near 0 (i.e. $E_p(SF) = Constant(d_{out}) \gg \epsilon$ ). This can be somewhat mitigated by zeroing the biases, but it is not sufficient. Explored methods are described in the next section.

(*) Like Xavier/Glorot (in which case a decrease is to be expected), Kaiming, or normal initialization.

There is another similar problem I am trying to solve: given an MLP architecture and an arbitrary probability distribution over the inputs, how to train a network with this architecture so that the output is a gaussian vector with mean 0 and variance 1, with identity covariance matrix. It is of course impossible in the general case, so we instead must try to minimize an objective function, something like $KL(NN||\mathcal{N}(0,1)) + ||\Sigma_{NN}||$. Studying predictive coding and autoencoders could be fruitful.

Results

Two machine learning techniques are used: gradient descent on a single network and evolutionnary algorithms on a population. An estimation of $SF(p)$ is needed in both cases, which is obtained as follows using the code's terminology:

  • A score is initialized at 0.
  • N_YS independent gaussian vectors are generated.
  • For each one of those vectors Y, N_XS independent random gaussian vectors are generated. The minimum of the distances between Y and the NN(X)s is subtracted to the score.
  • Once we have been through all Ys, the approximation is obtained by dividing the score by N_YS.

The consistency (and probably the accuracy as it is unbiased (?), TODO check) of this estimator is evaluated at each run by computing its variance over several measurements. Note that the expectation over the $X_i$, $E_{(X_1,..X_n)}$, is estimated with only one sample.

Since the MLP's output activation function is tanh, its maximal range is limited to $[-1, 1]^{d_{out}}$. To compensate, the components of the gaussian vectors sampled in the output space (the Ys) are scaled by .3 and then clamped to [-1, 1].

A- Genetic algorithm

  • Technicalities

Specimens are simply parameters sets. Sparse gaussian auto-regulatory mutations and sparse combinations, using RECURSIVE_NODES phylogenetic tracking. Selection is the 2 factor ranking technique used in RECURSIVE_NODES.

A fixed dataset of $N$ couples $(X,Y)$, randomly (i.i.d.) generated with a normal distribution, is used for the entirety of the evolution process. The fitness of a specimen $p$ is:

$f_p = -\frac{1}{N}\sum_{(X,Y) \in dataset}{||NN_p(X) - Y||_2}\newline$

  • Observations

As the average fitness over the population increases, we observe a decrease of $SF(p^)$, $p^$ being the fittest specimen at this step ( and also probably of the average $SF$ but is is expensive to compute accurately.).

B- Gradient descent

We generate a fixed dataset of $N$ couples $(X,Y)$, i.i.d. gaussian vectors, and apply gradient descent (full batch or inline yield the same results) to a randomly initialized network. As for the evolutionary process, we observe that as the loss on the dataset decreases, so does $SF$.

C- Comparison

Zeroing the biases at initialization kickstarts the convergence for both methods but does not improve the end result.

The final $SF$ are similar with GD and GA, but the GA is much slower. Performances on the proxy task were much better with GD, but that is irrelevant here.

Those results are satisfactory, but in the lack of theoretical studies of $SF$ I do not know if this is close to the lower bound for $SF$, with a given set of hyperparameters and this inference structure (tanh MLP).

spacespanningnetwork's People

Contributors

yeb02 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.