|
| 1 | +降噪自动编码机(Denoising Autoencoders) |
| 2 | +==================================== |
| 3 | + |
| 4 | +这一节假设读者一节阅读了[使用逻辑回归进行MNIST分类](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md),[多层感知机](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/3_Multilayer_Perceptron_多层感知机.md)。如果你需要在GPU上跑代码,你也需要阅读[GPU](http://deeplearning.net/software/theano/tutorial/using_gpu.html)。 |
| 5 | + |
| 6 | +本节所有的代码都可用在[这里](http://deeplearning.net/tutorial/code/dA.py)下载。 |
| 7 | + |
| 8 | +降噪自动编码机(denoising Autoencoders)是经典自动编码机的扩展。它在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中作为深度网络的一个构建块被介绍。我们在通过开始简短的[自动编码机](http://deeplearning.net/tutorial/dA.html#autoencoders)来开始本教程。 |
| 9 | + |
| 10 | +###自动编码机 |
| 11 | +在[Bengio09](http://deeplearning.net/tutorial/references.html#bengio09)的第4.6节中,有自动编码机的简介。一个自动编码机,由d维的[0,1]之间的输入向量x,通过第一层映射(使用一个编码器)来获得隐藏的d‘维度的[0,1]的输出表达y。通过如下的决定性映射: |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +这里s是一个非线性的函数,例如sigmoid。这个潜在的表达y,或者码,被映射回一个重构机z,来重构x。这个映射通过下面的简单转换来实现: |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | +(这里撇号不代表矩阵转置)当y被给定时,z被看着是对x的预测。可选的,这个权重矩阵W‘的逆映射可用被约束为正向映射的转置:,这被称为捆绑权重。这个模型的所有参数(W,b,b‘,或者不使用捆绑权重W’)通过优化最小平均重构误差来实现训练。 |
| 20 | + |
| 21 | +重建误差可以通过许多方法来度量,基于输入的分布假设。这个传统的平方误差是L(x,z)=||x-z||^2,可以被使用。假如这个输入是通过比特向量或者比特概率向量来表述,重构`交叉熵`([cross-entropy)可以被表示如下: |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +我们希望这样,这个编码y是一个分布式表达,可以捕捉数据中主要因子变化的位置。这就类似与主成分凸出,金额也捕捉数据中主要因子的变化。事实上,假如这里有一个线性隐藏层(这个编码),并且平均平方误差准则被用以训练这个网络,然后这k个隐藏单元学习去凸出这个输入,在该数据的前k个主成分的范围中。假如这个隐藏层是非线性的,这个自动编码机表现的是与PCA不同的,它有着捕捉输入分布的多模态方面的能力。从PCA开始变得更加重要,当我们考虑堆叠混合编码机(stacking multiple encoders,在[Hinton06](http://deeplearning.net/tutorial/references.html#hinton06)中被用以构建一个深度自动编码机)。 |
| 26 | + |
| 27 | +因为y是视为x的有损压缩(lossy compression),它不可能对所有的x有好的压缩。优化可以使得训练样本有好的压缩,同时也希望对别的输入有好的压缩,但是不是对于任意输入。这里有一个对自动编码机的概括定义:它对于与训练样本有相似分布的测试样本有较低的重建误差,但对于随机的输入会有较高的重构误差。 |
| 28 | + |
| 29 | +我们希望通过Theano中来实现自动编码机,作为一个类的形式,它可以在未来去构建一个层叠自动编码机。这个第一步是去创建自动编码机的共享变量参数(W,b,b‘)。 |
| 30 | + |
| 31 | +```Python |
| 32 | +class dA(object): |
| 33 | + """Denoising Auto-Encoder class (dA) |
| 34 | +
|
| 35 | + A denoising autoencoders tries to reconstruct the input from a corrupted |
| 36 | + version of it by projecting it first in a latent space and reprojecting |
| 37 | + it afterwards back in the input space. Please refer to Vincent et al.,2008 |
| 38 | + for more details. If x is the input then equation (1) computes a partially |
| 39 | + destroyed version of x by means of a stochastic mapping q_D. Equation (2) |
| 40 | + computes the projection of the input into the latent space. Equation (3) |
| 41 | + computes the reconstruction of the input, while equation (4) computes the |
| 42 | + reconstruction error. |
| 43 | +
|
| 44 | + .. math:: |
| 45 | +
|
| 46 | + \tilde{x} ~ q_D(\tilde{x}|x) (1) |
| 47 | +
|
| 48 | + y = s(W \tilde{x} + b) (2) |
| 49 | +
|
| 50 | + x = s(W' y + b') (3) |
| 51 | +
|
| 52 | + L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)] (4) |
| 53 | +
|
| 54 | + """ |
| 55 | + |
| 56 | + def __init__( |
| 57 | + self, |
| 58 | + numpy_rng, |
| 59 | + theano_rng=None, |
| 60 | + input=None, |
| 61 | + n_visible=784, |
| 62 | + n_hidden=500, |
| 63 | + W=None, |
| 64 | + bhid=None, |
| 65 | + bvis=None |
| 66 | + ): |
| 67 | + """ |
| 68 | + Initialize the dA class by specifying the number of visible units (the |
| 69 | + dimension d of the input ), the number of hidden units ( the dimension |
| 70 | + d' of the latent or hidden space ) and the corruption level. The |
| 71 | + constructor also receives symbolic variables for the input, weights and |
| 72 | + bias. Such a symbolic variables are useful when, for example the input |
| 73 | + is the result of some computations, or when weights are shared between |
| 74 | + the dA and an MLP layer. When dealing with SdAs this always happens, |
| 75 | + the dA on layer 2 gets as input the output of the dA on layer 1, |
| 76 | + and the weights of the dA are used in the second stage of training |
| 77 | + to construct an MLP. |
| 78 | +
|
| 79 | + :type numpy_rng: numpy.random.RandomState |
| 80 | + :param numpy_rng: number random generator used to generate weights |
| 81 | +
|
| 82 | + :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams |
| 83 | + :param theano_rng: Theano random generator; if None is given one is |
| 84 | + generated based on a seed drawn from `rng` |
| 85 | +
|
| 86 | + :type input: theano.tensor.TensorType |
| 87 | + :param input: a symbolic description of the input or None for |
| 88 | + standalone dA |
| 89 | +
|
| 90 | + :type n_visible: int |
| 91 | + :param n_visible: number of visible units |
| 92 | +
|
| 93 | + :type n_hidden: int |
| 94 | + :param n_hidden: number of hidden units |
| 95 | +
|
| 96 | + :type W: theano.tensor.TensorType |
| 97 | + :param W: Theano variable pointing to a set of weights that should be |
| 98 | + shared belong the dA and another architecture; if dA should |
| 99 | + be standalone set this to None |
| 100 | +
|
| 101 | + :type bhid: theano.tensor.TensorType |
| 102 | + :param bhid: Theano variable pointing to a set of biases values (for |
| 103 | + hidden units) that should be shared belong dA and another |
| 104 | + architecture; if dA should be standalone set this to None |
| 105 | +
|
| 106 | + :type bvis: theano.tensor.TensorType |
| 107 | + :param bvis: Theano variable pointing to a set of biases values (for |
| 108 | + visible units) that should be shared belong dA and another |
| 109 | + architecture; if dA should be standalone set this to None |
| 110 | +
|
| 111 | +
|
| 112 | + """ |
| 113 | + self.n_visible = n_visible |
| 114 | + self.n_hidden = n_hidden |
| 115 | + |
| 116 | + # create a Theano random generator that gives symbolic random values |
| 117 | + if not theano_rng: |
| 118 | + theano_rng = RandomStreams(numpy_rng.randint(2 ** 30)) |
| 119 | + |
| 120 | + # note : W' was written as `W_prime` and b' as `b_prime` |
| 121 | + if not W: |
| 122 | + # W is initialized with `initial_W` which is uniformely sampled |
| 123 | + # from -4*sqrt(6./(n_visible+n_hidden)) and |
| 124 | + # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if |
| 125 | + # converted using asarray to dtype |
| 126 | + # theano.config.floatX so that the code is runable on GPU |
| 127 | + initial_W = numpy.asarray( |
| 128 | + numpy_rng.uniform( |
| 129 | + low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), |
| 130 | + high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), |
| 131 | + size=(n_visible, n_hidden) |
| 132 | + ), |
| 133 | + dtype=theano.config.floatX |
| 134 | + ) |
| 135 | + W = theano.shared(value=initial_W, name='W', borrow=True) |
| 136 | + |
| 137 | + if not bvis: |
| 138 | + bvis = theano.shared( |
| 139 | + value=numpy.zeros( |
| 140 | + n_visible, |
| 141 | + dtype=theano.config.floatX |
| 142 | + ), |
| 143 | + borrow=True |
| 144 | + ) |
| 145 | + |
| 146 | + if not bhid: |
| 147 | + bhid = theano.shared( |
| 148 | + value=numpy.zeros( |
| 149 | + n_hidden, |
| 150 | + dtype=theano.config.floatX |
| 151 | + ), |
| 152 | + name='b', |
| 153 | + borrow=True |
| 154 | + ) |
| 155 | + |
| 156 | + self.W = W |
| 157 | + # b corresponds to the bias of the hidden |
| 158 | + self.b = bhid |
| 159 | + # b_prime corresponds to the bias of the visible |
| 160 | + self.b_prime = bvis |
| 161 | + # tied weights, therefore W_prime is W transpose |
| 162 | + self.W_prime = self.W.T |
| 163 | + self.theano_rng = theano_rng |
| 164 | + # if no input is given, generate a variable representing the input |
| 165 | + if input is None: |
| 166 | + # we use a matrix because we expect a minibatch of several |
| 167 | + # examples, each example being a row |
| 168 | + self.x = T.dmatrix(name='input') |
| 169 | + else: |
| 170 | + self.x = input |
| 171 | + |
| 172 | + self.params = [self.W, self.b, self.b_prime] |
| 173 | +``` |
| 174 | +注意,我们将`input`作为一个参数来传递给自动编码机。我们可以串联自动编码机来实现深度网络:第k层的输出(y)可以变成第k+1层的输入。 |
| 175 | + |
| 176 | +现在,我们可以预计去重构信号的潜在表达的计算量。 |
| 177 | + |
| 178 | +```Python |
| 179 | + def get_hidden_values(self, input): |
| 180 | + """ Computes the values of the hidden layer """ |
| 181 | + return T.nnet.sigmoid(T.dot(input, self.W) + self.b) |
| 182 | + |
| 183 | + def get_reconstructed_input(self, hidden): |
| 184 | + """Computes the reconstructed input given the values of the |
| 185 | + hidden layer |
| 186 | +
|
| 187 | + """ |
| 188 | + return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime) |
| 189 | +``` |
| 190 | +然后我们通过这些函数可以计算一个随机梯度下降步骤的cost和更新。 |
| 191 | + |
| 192 | +```Python |
| 193 | + def get_cost_updates(self, corruption_level, learning_rate): |
| 194 | + """ This function computes the cost and the updates for one trainng |
| 195 | + step of the dA """ |
| 196 | + |
| 197 | + tilde_x = self.get_corrupted_input(self.x, corruption_level) |
| 198 | + y = self.get_hidden_values(tilde_x) |
| 199 | + z = self.get_reconstructed_input(y) |
| 200 | + # note : we sum over the size of a datapoint; if we are using |
| 201 | + # minibatches, L will be a vector, with one entry per |
| 202 | + # example in minibatch |
| 203 | + L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1) |
| 204 | + # note : L is now a vector, where each element is the |
| 205 | + # cross-entropy cost of the reconstruction of the |
| 206 | + # corresponding example of the minibatch. We need to |
| 207 | + # compute the average of all these to get the cost of |
| 208 | + # the minibatch |
| 209 | + cost = T.mean(L) |
| 210 | + |
| 211 | + # compute the gradients of the cost of the `dA` with respect |
| 212 | + # to its parameters |
| 213 | + gparams = T.grad(cost, self.params) |
| 214 | + # generate the list of updates |
| 215 | + updates = [ |
| 216 | + (param, param - learning_rate * gparam) |
| 217 | + for param, gparam in zip(self.params, gparams) |
| 218 | + ] |
| 219 | + |
| 220 | + return (cost, updates) |
| 221 | +``` |
| 222 | +我们现在可以定义一个函数来实现重复的更新参数W,b,b‘,直到这个重构消耗大约是最小的。 |
| 223 | + |
| 224 | +```Python |
| 225 | + da = dA( |
| 226 | + numpy_rng=rng, |
| 227 | + theano_rng=theano_rng, |
| 228 | + input=x, |
| 229 | + n_visible=28 * 28, |
| 230 | + n_hidden=500 |
| 231 | + ) |
| 232 | + |
| 233 | + cost, updates = da.get_cost_updates( |
| 234 | + corruption_level=0., |
| 235 | + learning_rate=learning_rate |
| 236 | + ) |
| 237 | + |
| 238 | + train_da = theano.function( |
| 239 | + [index], |
| 240 | + cost, |
| 241 | + updates=updates, |
| 242 | + givens={ |
| 243 | + x: train_set_x[index * batch_size: (index + 1) * batch_size] |
| 244 | + } |
| 245 | + ) |
| 246 | +``` |
| 247 | +假设没有最小化重构误差的限制,一个有n个输入的自动编码机 |
| 248 | + |
| 249 | + |
| 250 | + |
| 251 | + |
| 252 | + |
| 253 | + |
| 254 | + |
| 255 | + |
| 256 | + |
| 257 | + |
| 258 | + |
| 259 | + |
| 260 | + |
| 261 | + |
| 262 | + |
| 263 | + |
| 264 | + |
| 265 | + |
| 266 | + |
| 267 | + |
| 268 | + |
| 269 | + |
| 270 | + |
| 271 | + |
| 272 | + |
| 273 | + |
| 274 | + |
| 275 | + |
| 276 | + |
| 277 | + |
| 278 | + |
| 279 | + |
| 280 | + |
| 281 | + |
| 282 | + |
0 commit comments