1414
1515 [ MNIST] ( http://yann.lecun.com/exdb/mnist ) 是一个包含60000个训练样例和10000个测试样例的手写数字图像的数据集。在许多论文,包括本教程,都将60000个训练样例分为50000个样例的训练集和10000个样例的验证集(为了超参数,例如学习率、模型尺寸等等)。所有的数字图像都被归一化和中心化为28* 28的像素,256位图的灰度图。
1616 为了方便在Python中的使用,我们对数据集进行了处理。你可以在这里[ 下载] ( http://deeplearning.net/data/mnist/mnist.pkl.gz ) 。这个文件被表示为包含3个lists的tuple:训练集、验证集和测试集。每个lists都是都是两个list的组合,一个list是有numpy的1维array表示的784(28* 28)维的0~1(0是黑,1是白)的float值,另一个list是0~9的图像标签。下面的代码显示了如何去加载这个数据集。
17+
1718``` Python
1819import cPickle, gzip, numpy
1920# Load the dataset
2021f = gzip.open(' mnist.pkl.gz' , ' rb' )
2122train_set, valid_set, test_set = cPickle.load(f)
2223f.close()
2324```
25+
2426 当我们使用这个数据集的时候,通常将它分割维几个minibatch。我们建议你将数据集储存为共享变量(shared variables),通过minibatch的索引(一个固定的被告知的batch的尺寸)来存取它们。使用共享变量的原因是为了使用GPU。因为往GPUX显存中复制数据是一个巨大的开销。如果不使用共享变量,GPU代码的运行效率将不会比CPU代码快。如果你将自己的数据定义为共享变量,当共享变量被构建的时候,你就给了Theano在一次请求中将整个数据复制到GPU上的可能。之后,GPU就可以通过共享变量的slice(切片)来存取任何一个minibatch,而不必再从CPU上拷贝数据。同时,因为数据向量(实数)和标签(整数)通常是不同属性的,测试集、验证集和训练集是不同目的的,所以我们建议通过不同的共享变量来储存(这就产生了6个不同的共享变量)。
2527 由于现在的数据再一个变量里面,一个minibatch被定义为这个变量的一个切片。通过指定它的索引和它的尺寸,可以更加自然的来定义一个minibatch。下面的代码展示了如何去存取数据和如何存取一个minibatch。
28+
2629``` Python
2730def shared_dataset (data_xy ):
2831 """ Function that loads the dataset into shared variables
@@ -45,6 +48,7 @@ def shared_dataset(data_xy):
4548 # lets us get around this issue
4649 return shared_x, T.cast(shared_y, ' int32' )
4750```
51+
4852 这个数据以float的形式被存储在GPU上(` dtype ` 被定义为` theano.confug.floatX ` )。然后再将标签转换为int型。
4953 如果你再GPU上跑代码,并且数据集太大,可能导致内存崩溃。在这个时候,你就应当把数据存储为共享变量。你可以将数据储存为一个充分小的大块(几个minibatch)在一个共享变量里面,然后在训练的时候使用它。一旦你使用了这个大块,更新它储存的值。这将最小化CPU和GPU的内存交换。
5054
@@ -64,6 +68,176 @@ import numpy
6468```
6569
6670###深度学习的监督优化入门
71+ #####学习一个分类器
72+ ######0 -1损失函数
73+ f(x)=argmax(k) P(Y=k|x,theta)
74+ L=sum(I(f(x)==y))
75+
76+ ``` Python
77+ # zero_one_loss is a Theano variable representing a symbolic
78+ # expression of the zero one loss ; to get the actual value this
79+ # symbolic expression has to be compiled into a Theano function (see
80+ # the Theano tutorial for more details)
81+ zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))
82+ ```
83+
84+ ######负对数似然损失函数
85+ 由于0-1损失函数不可微分,在大型模型中对它优化会造成巨大开销。因此我们通过最大化给定数据标签的似然函数来训练模型。
86+ 由于我们通常说最小化损失函数,所以我们给对数似然函数添加负号,来使得我们可以求解最小化负对数似然损失函数。
87+
88+ ``` Python
89+ # NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
90+ # expression has to be compiled into a Theano function (see the Theano
91+ # tutorial for more details)
92+ NLL = - T.sum(T.log(p_y_given_x)[T.arange(y.shape[0 ]), y])
93+ # note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
94+ # Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
95+ # elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this
96+ # syntax to retrieve the log-probability of the correct labels, y.
97+ ```
98+
99+ #####随机梯度下降
100+ 什么是普通的梯度下降?梯度下降是一个简单的算法,利用负梯度方向来决定每次迭代的新的搜索方向,使得每次迭代能使待优化的目标函数逐步减小。
101+ 伪代码如下所示。
102+
103+ ``` Python
104+ # GRADIENT DESCENT
105+
106+ while True :
107+ loss = f(params)
108+ d_loss_wrt_params = ... # compute gradient
109+ params -= learning_rate * d_loss_wrt_params
110+ if < stopping condition is met> :
111+ return params
112+ ```
113+
114+ 随机梯度下降则是普通梯度下降的优化。通过使用一部分样本来优化梯度代替所有样本优化梯度,从而得以更快逼近结果。下面的代码,我们一次只用一个样本来计算梯度。
115+
116+ ``` Python
117+ # STOCHASTIC GRADIENT DESCENT
118+ for (x_i,y_i) in training_set:
119+ # imagine an infinite generator
120+ # that may repeat examples (if there is only a finite training set)
121+ loss = f(params, x_i, y_i)
122+ d_loss_wrt_params = ... # compute gradient
123+ params -= learning_rate * d_loss_wrt_params
124+ if < stopping condition is met> :
125+ return params
126+ ```
127+
128+ 我们不止一次的在深度学习中提及这个变体——“minibatches”。Minibatch随机梯度下降区别与随机梯度下降,在每次梯度估计时使用一个minibatch的数据。这个技术减小了每次梯度估计时的方差,也适合现代电脑的分层内存构架。
129+
130+ ``` Python
131+ for (x_batch,y_batch) in train_batches:
132+ # imagine an infinite generator
133+ # that may repeat examples
134+ loss = f(params, x_batch, y_batch)
135+ d_loss_wrt_params = ... # compute gradient using theano
136+ params -= learning_rate * d_loss_wrt_params
137+ if < stopping condition is met> :
138+ return params
139+ ```
140+
141+ 在选择minibatch的尺寸B时中有个权衡。当尺寸比较大时,在梯度估计时就要花费更多时间计算方差;当尺寸比较小的时候呢,就要进行更多的迭代,也更容易波动。因而尺寸的选择要结合模型、数据集、硬件结构等,从1到几百不等。
142+ 伪代码如下。
143+
144+ ``` Python
145+ # Minibatch Stochastic Gradient Descent
146+
147+ # assume loss is a symbolic description of the loss function given
148+ # the symbolic variables params (shared variable), x_batch, y_batch;
149+
150+ # compute gradient of loss with respect to params
151+ d_loss_wrt_params = T.grad(loss, params)
152+
153+ # compile the MSGD step into a theano function
154+ updates = [(params, params - learning_rate * d_loss_wrt_params)]
155+ MSGD = theano.function([x_batch,y_batch], loss, updates = updates)
156+
157+ for (x_batch, y_batch) in train_batches:
158+ # here x_batch and y_batch are elements of train_batches and
159+ # therefore numpy arrays; function MSGD also updates the params
160+ print (' Current loss is ' , MSGD(x_batch, y_batch))
161+ if stopping_condition_is_met:
162+ return params
163+ ```
164+
165+ #####正则化
166+ 正则化是为了防止在MSGD训练过程中出现过拟合。为了应对过拟合,我们提出了几个方法:L1/L2正则化和early-stopping。
167+ ######L1/L2正则化
168+
169+ ``` Python
170+ # symbolic Theano variable that represents the L1 regularization term
171+ L1 = T.sum(abs (param))
172+
173+ # symbolic Theano variable that represents the squared L2 term
174+ L2_sqr = T.sum(param ** 2 )
175+
176+ # the loss
177+ loss = NLL + lambda_1 * L1 + lambda_2 * L2
178+ ```
179+
180+ #######Early-stopping
181+
182+
183+ ``` Python
184+ # early-stopping parameters
185+ patience = 5000 # look as this many examples regardless
186+ patience_increase = 2 # wait this much longer when a new best is
187+ # found
188+ improvement_threshold = 0.995 # a relative improvement of this much is
189+ # considered significant
190+ validation_frequency = min (n_train_batches, patience/ 2 )
191+ # go through this many
192+ # minibatches before checking the network
193+ # on the validation set; in this case we
194+ # check every epoch
195+
196+ best_params = None
197+ best_validation_loss = numpy.inf
198+ test_score = 0 .
199+ start_time = time.clock()
200+
201+ done_looping = False
202+ epoch = 0
203+ while (epoch < n_epochs) and (not done_looping):
204+ # Report "1" for first epoch, "n_epochs" for last epoch
205+ epoch = epoch + 1
206+ for minibatch_index in xrange (n_train_batches):
207+
208+ d_loss_wrt_params = ... # compute gradient
209+ params -= learning_rate * d_loss_wrt_params # gradient descent
210+
211+ # iteration number. We want it to start at 0.
212+ iter = (epoch - 1 ) * n_train_batches + minibatch_index
213+ # note that if we do `iter % validation_frequency` it will be
214+ # true for iter = 0 which we do not want. We want it true for
215+ # iter = validation_frequency - 1.
216+ if (iter + 1 ) % validation_frequency == 0 :
217+
218+ this_validation_loss = ... # compute zero-one loss on validation set
219+
220+ if this_validation_loss < best_validation_loss:
221+
222+ # improve patience if loss improvement is good enough
223+ if this_validation_loss < best_validation_loss * improvement_threshold:
224+
225+ patience = max (patience, iter * patience_increase)
226+ best_params = copy.deepcopy(params)
227+ best_validation_loss = this_validation_loss
228+
229+ if patience <= iter :
230+ done_looping = True
231+ break
232+
233+ # POSTCONDITION:
234+ # best_params refers to the best out-of-sample parameters observed during the optimization
235+ ```
236+
237+
238+
239+
240+
67241
68242
69243
0 commit comments