From 6cc2fb73c20974c6a06cba5b8814ac1a7ce0ff62 Mon Sep 17 00:00:00 2001
From: Xavier Glorot <glorotxa@timide.iro.umontreal.ca>
Date: Wed, 31 Mar 2010 13:46:39 -0400
Subject: [PATCH 1/2] test

---
 doc/mlp.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/mlp.txt b/doc/mlp.txt
index f3ddae66..6f6bf16c 100644
--- a/doc/mlp.txt
+++ b/doc/mlp.txt
@@ -402,7 +402,7 @@ properties.
 Weight initialization
 ---------------------
 
-The rationale for initializing the weights by sampling from 
+The rational for initializing the weights by sampling from 
 :math:`uniform[-\frac{1}{\sqrt{fan_{in}}},\frac{1}{\sqrt{fan_{in}}}]` is to
 make learning faster at the beginning on training. By initializing with
 small random values around the origin, we make sure that the sigmoid

From 938aa81efc3df1a4dbf5246f5853363c13c0777c Mon Sep 17 00:00:00 2001
From: Xavier Glorot <glorotxa@timide.iro.umontreal.ca>
Date: Wed, 31 Mar 2010 15:49:09 -0400
Subject: [PATCH 2/2] Change in weights initialization description for mlp.txt

---
 doc/mlp.txt | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/doc/mlp.txt b/doc/mlp.txt
index 6f6bf16c..fd27c104 100644
--- a/doc/mlp.txt
+++ b/doc/mlp.txt
@@ -402,16 +402,17 @@ properties.
 Weight initialization
 ---------------------
 
-The rational for initializing the weights by sampling from 
-:math:`uniform[-\frac{1}{\sqrt{fan_{in}}},\frac{1}{\sqrt{fan_{in}}}]` is to
-make learning faster at the beginning on training. By initializing with
-small random values around the origin, we make sure that the sigmoid
-operates in its linear regime, where gradient updates are largest. 
-
-On their own, weights cannot assure that this holds true. As explained in
-explained in `Section 4.6 <http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf>`_, 
-this requires coordination between normalization of inputs (to zero-mean and
-standard deviation of 1) and a proper choice of the sigmoid.
+At initialization we want the weights to be small enough around the origin
+so that the activation function operates in its linear regime, where gradients are
+the largest. Other desirable properties, especially for deep networks,
+are to conserve variance of the activation as well as variance of back-propagated gradients from layer to layer.
+This allows information to flow well upward and downward in the network and
+reduces discrepancies between layers.
+Under some assumptions, a compromise between these two constraints leads to the following
+initialization: :math:`uniform[-\frac{6}{\sqrt{fan_{in}+fan_{out}}},\frac{6}{\sqrt{fan_{in}+fan_{out}}}]`
+for tanh and :math:`uniform[-\4*frac{6}{\sqrt{fan_{in}+fan_{out}}},\4*frac{6}{\sqrt{fan_{in}+fan_{out}}}]`
+for sigmoid. Where :math:`fan_{in}` is the number of inputs and :math:`fan_{out}` the number of hidden units.
+For mathematical considerations please refer to [Xavier10].
 
 Learning rate
 --------------