From 6cc2fb73c20974c6a06cba5b8814ac1a7ce0ff62 Mon Sep 17 00:00:00 2001 From: Xavier Glorot Date: Wed, 31 Mar 2010 13:46:39 -0400 Subject: [PATCH 1/2] test --- doc/mlp.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/mlp.txt b/doc/mlp.txt index f3ddae66..6f6bf16c 100644 --- a/doc/mlp.txt +++ b/doc/mlp.txt @@ -402,7 +402,7 @@ properties. Weight initialization --------------------- -The rationale for initializing the weights by sampling from +The rational for initializing the weights by sampling from :math:`uniform[-\frac{1}{\sqrt{fan_{in}}},\frac{1}{\sqrt{fan_{in}}}]` is to make learning faster at the beginning on training. By initializing with small random values around the origin, we make sure that the sigmoid From 938aa81efc3df1a4dbf5246f5853363c13c0777c Mon Sep 17 00:00:00 2001 From: Xavier Glorot Date: Wed, 31 Mar 2010 15:49:09 -0400 Subject: [PATCH 2/2] Change in weights initialization description for mlp.txt --- doc/mlp.txt | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/doc/mlp.txt b/doc/mlp.txt index 6f6bf16c..fd27c104 100644 --- a/doc/mlp.txt +++ b/doc/mlp.txt @@ -402,16 +402,17 @@ properties. Weight initialization --------------------- -The rational for initializing the weights by sampling from -:math:`uniform[-\frac{1}{\sqrt{fan_{in}}},\frac{1}{\sqrt{fan_{in}}}]` is to -make learning faster at the beginning on training. By initializing with -small random values around the origin, we make sure that the sigmoid -operates in its linear regime, where gradient updates are largest. - -On their own, weights cannot assure that this holds true. As explained in -explained in `Section 4.6 `_, -this requires coordination between normalization of inputs (to zero-mean and -standard deviation of 1) and a proper choice of the sigmoid. +At initialization we want the weights to be small enough around the origin +so that the activation function operates in its linear regime, where gradients are +the largest. Other desirable properties, especially for deep networks, +are to conserve variance of the activation as well as variance of back-propagated gradients from layer to layer. +This allows information to flow well upward and downward in the network and +reduces discrepancies between layers. +Under some assumptions, a compromise between these two constraints leads to the following +initialization: :math:`uniform[-\frac{6}{\sqrt{fan_{in}+fan_{out}}},\frac{6}{\sqrt{fan_{in}+fan_{out}}}]` +for tanh and :math:`uniform[-\4*frac{6}{\sqrt{fan_{in}+fan_{out}}},\4*frac{6}{\sqrt{fan_{in}+fan_{out}}}]` +for sigmoid. Where :math:`fan_{in}` is the number of inputs and :math:`fan_{out}` the number of hidden units. +For mathematical considerations please refer to [Xavier10]. Learning rate --------------