Skip to content

Commit 159e26e

Browse files
committed
update getting started
1 parent 2da0afb commit 159e26e

File tree

1 file changed

+174
-0
lines changed

1 file changed

+174
-0
lines changed

1_Getting_Started_入门.md

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,18 @@
1414

1515
[MNIST](http://yann.lecun.com/exdb/mnist)是一个包含60000个训练样例和10000个测试样例的手写数字图像的数据集。在许多论文,包括本教程,都将60000个训练样例分为50000个样例的训练集和10000个样例的验证集(为了超参数,例如学习率、模型尺寸等等)。所有的数字图像都被归一化和中心化为28*28的像素,256位图的灰度图。
1616
为了方便在Python中的使用,我们对数据集进行了处理。你可以在这里[下载](http://deeplearning.net/data/mnist/mnist.pkl.gz)。这个文件被表示为包含3个lists的tuple:训练集、验证集和测试集。每个lists都是都是两个list的组合,一个list是有numpy的1维array表示的784(28*28)维的0~1(0是黑,1是白)的float值,另一个list是0~9的图像标签。下面的代码显示了如何去加载这个数据集。
17+
1718
```Python
1819
import cPickle, gzip, numpy
1920
# Load the dataset
2021
f = gzip.open('mnist.pkl.gz', 'rb')
2122
train_set, valid_set, test_set = cPickle.load(f)
2223
f.close()
2324
```
25+
2426
当我们使用这个数据集的时候,通常将它分割维几个minibatch。我们建议你将数据集储存为共享变量(shared variables),通过minibatch的索引(一个固定的被告知的batch的尺寸)来存取它们。使用共享变量的原因是为了使用GPU。因为往GPUX显存中复制数据是一个巨大的开销。如果不使用共享变量,GPU代码的运行效率将不会比CPU代码快。如果你将自己的数据定义为共享变量,当共享变量被构建的时候,你就给了Theano在一次请求中将整个数据复制到GPU上的可能。之后,GPU就可以通过共享变量的slice(切片)来存取任何一个minibatch,而不必再从CPU上拷贝数据。同时,因为数据向量(实数)和标签(整数)通常是不同属性的,测试集、验证集和训练集是不同目的的,所以我们建议通过不同的共享变量来储存(这就产生了6个不同的共享变量)。
2527
由于现在的数据再一个变量里面,一个minibatch被定义为这个变量的一个切片。通过指定它的索引和它的尺寸,可以更加自然的来定义一个minibatch。下面的代码展示了如何去存取数据和如何存取一个minibatch。
28+
2629
```Python
2730
def shared_dataset(data_xy):
2831
""" Function that loads the dataset into shared variables
@@ -45,6 +48,7 @@ def shared_dataset(data_xy):
4548
# lets us get around this issue
4649
return shared_x, T.cast(shared_y, 'int32')
4750
```
51+
4852
这个数据以float的形式被存储在GPU上(`dtype`被定义为`theano.confug.floatX`)。然后再将标签转换为int型。
4953
如果你再GPU上跑代码,并且数据集太大,可能导致内存崩溃。在这个时候,你就应当把数据存储为共享变量。你可以将数据储存为一个充分小的大块(几个minibatch)在一个共享变量里面,然后在训练的时候使用它。一旦你使用了这个大块,更新它储存的值。这将最小化CPU和GPU的内存交换。
5054

@@ -64,6 +68,176 @@ import numpy
6468
```
6569

6670
###深度学习的监督优化入门
71+
#####学习一个分类器
72+
######0-1损失函数
73+
f(x)=argmax(k) P(Y=k|x,theta)
74+
L=sum(I(f(x)==y))
75+
76+
```Python
77+
# zero_one_loss is a Theano variable representing a symbolic
78+
# expression of the zero one loss ; to get the actual value this
79+
# symbolic expression has to be compiled into a Theano function (see
80+
# the Theano tutorial for more details)
81+
zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))
82+
```
83+
84+
######负对数似然损失函数
85+
由于0-1损失函数不可微分,在大型模型中对它优化会造成巨大开销。因此我们通过最大化给定数据标签的似然函数来训练模型。
86+
由于我们通常说最小化损失函数,所以我们给对数似然函数添加负号,来使得我们可以求解最小化负对数似然损失函数。
87+
88+
```Python
89+
# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
90+
# expression has to be compiled into a Theano function (see the Theano
91+
# tutorial for more details)
92+
NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
93+
# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
94+
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
95+
# elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this
96+
# syntax to retrieve the log-probability of the correct labels, y.
97+
```
98+
99+
#####随机梯度下降
100+
什么是普通的梯度下降?梯度下降是一个简单的算法,利用负梯度方向来决定每次迭代的新的搜索方向,使得每次迭代能使待优化的目标函数逐步减小。
101+
伪代码如下所示。
102+
103+
```Python
104+
# GRADIENT DESCENT
105+
106+
while True:
107+
loss = f(params)
108+
d_loss_wrt_params = ... # compute gradient
109+
params -= learning_rate * d_loss_wrt_params
110+
if <stopping condition is met>:
111+
return params
112+
```
113+
114+
随机梯度下降则是普通梯度下降的优化。通过使用一部分样本来优化梯度代替所有样本优化梯度,从而得以更快逼近结果。下面的代码,我们一次只用一个样本来计算梯度。
115+
116+
```Python
117+
# STOCHASTIC GRADIENT DESCENT
118+
for (x_i,y_i) in training_set:
119+
# imagine an infinite generator
120+
# that may repeat examples (if there is only a finite training set)
121+
loss = f(params, x_i, y_i)
122+
d_loss_wrt_params = ... # compute gradient
123+
params -= learning_rate * d_loss_wrt_params
124+
if <stopping condition is met>:
125+
return params
126+
```
127+
128+
我们不止一次的在深度学习中提及这个变体——“minibatches”。Minibatch随机梯度下降区别与随机梯度下降,在每次梯度估计时使用一个minibatch的数据。这个技术减小了每次梯度估计时的方差,也适合现代电脑的分层内存构架。
129+
130+
```Python
131+
for (x_batch,y_batch) in train_batches:
132+
# imagine an infinite generator
133+
# that may repeat examples
134+
loss = f(params, x_batch, y_batch)
135+
d_loss_wrt_params = ... # compute gradient using theano
136+
params -= learning_rate * d_loss_wrt_params
137+
if <stopping condition is met>:
138+
return params
139+
```
140+
141+
在选择minibatch的尺寸B时中有个权衡。当尺寸比较大时,在梯度估计时就要花费更多时间计算方差;当尺寸比较小的时候呢,就要进行更多的迭代,也更容易波动。因而尺寸的选择要结合模型、数据集、硬件结构等,从1到几百不等。
142+
伪代码如下。
143+
144+
```Python
145+
# Minibatch Stochastic Gradient Descent
146+
147+
# assume loss is a symbolic description of the loss function given
148+
# the symbolic variables params (shared variable), x_batch, y_batch;
149+
150+
# compute gradient of loss with respect to params
151+
d_loss_wrt_params = T.grad(loss, params)
152+
153+
# compile the MSGD step into a theano function
154+
updates = [(params, params - learning_rate * d_loss_wrt_params)]
155+
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)
156+
157+
for (x_batch, y_batch) in train_batches:
158+
# here x_batch and y_batch are elements of train_batches and
159+
# therefore numpy arrays; function MSGD also updates the params
160+
print('Current loss is ', MSGD(x_batch, y_batch))
161+
if stopping_condition_is_met:
162+
return params
163+
```
164+
165+
#####正则化
166+
正则化是为了防止在MSGD训练过程中出现过拟合。为了应对过拟合,我们提出了几个方法:L1/L2正则化和early-stopping。
167+
######L1/L2正则化
168+
169+
```Python
170+
# symbolic Theano variable that represents the L1 regularization term
171+
L1 = T.sum(abs(param))
172+
173+
# symbolic Theano variable that represents the squared L2 term
174+
L2_sqr = T.sum(param ** 2)
175+
176+
# the loss
177+
loss = NLL + lambda_1 * L1 + lambda_2 * L2
178+
```
179+
180+
#######Early-stopping
181+
182+
183+
```Python
184+
# early-stopping parameters
185+
patience = 5000 # look as this many examples regardless
186+
patience_increase = 2 # wait this much longer when a new best is
187+
# found
188+
improvement_threshold = 0.995 # a relative improvement of this much is
189+
# considered significant
190+
validation_frequency = min(n_train_batches, patience/2)
191+
# go through this many
192+
# minibatches before checking the network
193+
# on the validation set; in this case we
194+
# check every epoch
195+
196+
best_params = None
197+
best_validation_loss = numpy.inf
198+
test_score = 0.
199+
start_time = time.clock()
200+
201+
done_looping = False
202+
epoch = 0
203+
while (epoch < n_epochs) and (not done_looping):
204+
# Report "1" for first epoch, "n_epochs" for last epoch
205+
epoch = epoch + 1
206+
for minibatch_index in xrange(n_train_batches):
207+
208+
d_loss_wrt_params = ... # compute gradient
209+
params -= learning_rate * d_loss_wrt_params # gradient descent
210+
211+
# iteration number. We want it to start at 0.
212+
iter = (epoch - 1) * n_train_batches + minibatch_index
213+
# note that if we do `iter % validation_frequency` it will be
214+
# true for iter = 0 which we do not want. We want it true for
215+
# iter = validation_frequency - 1.
216+
if (iter + 1) % validation_frequency == 0:
217+
218+
this_validation_loss = ... # compute zero-one loss on validation set
219+
220+
if this_validation_loss < best_validation_loss:
221+
222+
# improve patience if loss improvement is good enough
223+
if this_validation_loss < best_validation_loss * improvement_threshold:
224+
225+
patience = max(patience, iter * patience_increase)
226+
best_params = copy.deepcopy(params)
227+
best_validation_loss = this_validation_loss
228+
229+
if patience <= iter:
230+
done_looping = True
231+
break
232+
233+
# POSTCONDITION:
234+
# best_params refers to the best out-of-sample parameters observed during the optimization
235+
```
236+
237+
238+
239+
240+
67241

68242

69243

0 commit comments

Comments
 (0)