We are trying to maximize the range of MLPs (tanh), at fixed topology. The range of a network is defined as the set of points it "typically" outputs given a gaussian input vector. We are interested in the cases where the input dimension is much less than the output dimension.
There are many possible measures of how good a certain range is. I opted for the following formal definition of the problem. We are looking for the set of weights and biases
$n \in \mathbb{N}^*, \quad \forall i \in [1, n] \quad X_i \hookrightarrow \mathcal{N}(0,1)^{d_{in}}, \quad (X_i)_i \quad i.i.d.\newline$
Satisfies:
$ \tilde{p} \quad = \quad argmin_p[ \quad E_{Y}( \quad E_{(X_1,..X_n)}[ \quad min_{i\in[1,n]}( \quad ||NN_p(X_i) - Y||_2 \quad )])]$
We will refer to the quantity in the
This expression means that for a likely
This investigation was motivated by the fact that as the number of layers of a randomly initialized* MLP grows, the output range shrinks, and quickly collapses to near 0 (i.e.
(*) Like Xavier/Glorot (in which case a decrease is to be expected), Kaiming, or normal initialization.
There is another similar problem I am trying to solve: given an MLP architecture and an arbitrary probability distribution over the inputs, how to train a network with this architecture so that the output is a gaussian vector with mean 0 and variance 1, with identity covariance matrix. It is of course impossible in the general case, so we instead must try to minimize an objective function, something like
Two machine learning techniques are used: gradient descent on a single network and evolutionnary algorithms on a population. An estimation of
- A score is initialized at 0.
- N_YS independent gaussian vectors are generated.
- For each one of those vectors Y, N_XS independent random gaussian vectors are generated. The minimum of the distances between Y and the NN(X)s is subtracted to the score.
- Once we have been through all Ys, the approximation is obtained by dividing the score by N_YS.
The consistency (and probably the accuracy as it is unbiased (?), TODO check)
of this estimator is evaluated at each run by computing its variance over several measurements. Note that the expectation over the
Since the MLP's output activation function is tanh, its maximal range is limited to
- Technicalities
Specimens are simply parameters sets. Sparse gaussian auto-regulatory mutations and sparse combinations, using RECURSIVE_NODES phylogenetic tracking. Selection is the 2 factor ranking technique used in RECURSIVE_NODES.
A fixed dataset of
- Observations
As the average fitness over the population increases, we observe a decrease of $SF(p^)$, $p^$ being the fittest specimen at this step ( and also probably of the average
We generate a fixed dataset of
Zeroing the biases at initialization kickstarts the convergence for both methods but does not improve the end result.
The final
Those results are satisfactory, but in the lack of theoretical studies of