Skip to content

Commit 2da0afb

Browse files
committed
update read
1 parent 8e429c4 commit 2da0afb

File tree

2 files changed

+29
-31
lines changed

2 files changed

+29
-31
lines changed

1_Getting_Started_入门.md

Lines changed: 29 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -14,38 +14,37 @@
1414

1515
[MNIST](http://yann.lecun.com/exdb/mnist)是一个包含60000个训练样例和10000个测试样例的手写数字图像的数据集。在许多论文,包括本教程,都将60000个训练样例分为50000个样例的训练集和10000个样例的验证集(为了超参数,例如学习率、模型尺寸等等)。所有的数字图像都被归一化和中心化为28*28的像素,256位图的灰度图。
1616
为了方便在Python中的使用,我们对数据集进行了处理。你可以在这里[下载](http://deeplearning.net/data/mnist/mnist.pkl.gz)。这个文件被表示为包含3个lists的tuple:训练集、验证集和测试集。每个lists都是都是两个list的组合,一个list是有numpy的1维array表示的784(28*28)维的0~1(0是黑,1是白)的float值,另一个list是0~9的图像标签。下面的代码显示了如何去加载这个数据集。
17-
18-
import cPickle, gzip, numpy
19-
20-
# Load the dataset
21-
f = gzip.open('mnist.pkl.gz', 'rb')
22-
train_set, valid_set, test_set = cPickle.load(f)
23-
f.close()
24-
17+
```Python
18+
import cPickle, gzip, numpy
19+
# Load the dataset
20+
f = gzip.open('mnist.pkl.gz', 'rb')
21+
train_set, valid_set, test_set = cPickle.load(f)
22+
f.close()
23+
```
2524
当我们使用这个数据集的时候,通常将它分割维几个minibatch。我们建议你将数据集储存为共享变量(shared variables),通过minibatch的索引(一个固定的被告知的batch的尺寸)来存取它们。使用共享变量的原因是为了使用GPU。因为往GPUX显存中复制数据是一个巨大的开销。如果不使用共享变量,GPU代码的运行效率将不会比CPU代码快。如果你将自己的数据定义为共享变量,当共享变量被构建的时候,你就给了Theano在一次请求中将整个数据复制到GPU上的可能。之后,GPU就可以通过共享变量的slice(切片)来存取任何一个minibatch,而不必再从CPU上拷贝数据。同时,因为数据向量(实数)和标签(整数)通常是不同属性的,测试集、验证集和训练集是不同目的的,所以我们建议通过不同的共享变量来储存(这就产生了6个不同的共享变量)。
2625
由于现在的数据再一个变量里面,一个minibatch被定义为这个变量的一个切片。通过指定它的索引和它的尺寸,可以更加自然的来定义一个minibatch。下面的代码展示了如何去存取数据和如何存取一个minibatch。
27-
28-
def shared_dataset(data_xy):
29-
""" Function that loads the dataset into shared variables
30-
31-
The reason we store our dataset in shared variables is to allow
32-
Theano to copy it into the GPU memory (when code is run on GPU).
33-
Since copying data into the GPU is slow, copying a minibatch everytime
34-
is needed (the default behaviour if the data is not in a shared
35-
variable) would lead to a large decrease in performance.
36-
"""
37-
data_x, data_y = data_xy
38-
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
39-
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
40-
# When storing data on the GPU it has to be stored as floats
41-
# therefore we will store the labels as ``floatX`` as well
42-
# (``shared_y`` does exactly that). But during our computations
43-
# we need them as ints (we use labels as index, and if they are
44-
# floats it doesn't make sense) therefore instead of returning
45-
# ``shared_y`` we will have to cast it to int. This little hack
46-
# lets us get around this issue
47-
return shared_x, T.cast(shared_y, 'int32')
48-
26+
```Python
27+
def shared_dataset(data_xy):
28+
""" Function that loads the dataset into shared variables
29+
30+
The reason we store our dataset in shared variables is to allow
31+
Theano to copy it into the GPU memory (when code is run on GPU).
32+
Since copying data into the GPU is slow, copying a minibatch everytime
33+
is needed (the default behaviour if the data is not in a shared
34+
variable) would lead to a large decrease in performance.
35+
"""
36+
data_x, data_y = data_xy
37+
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
38+
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
39+
# When storing data on the GPU it has to be stored as floats
40+
# therefore we will store the labels as ``floatX`` as well
41+
# (``shared_y`` does exactly that). But during our computations
42+
# we need them as ints (we use labels as index, and if they are
43+
# floats it doesn't make sense) therefore instead of returning
44+
# ``shared_y`` we will have to cast it to int. This little hack
45+
# lets us get around this issue
46+
return shared_x, T.cast(shared_y, 'int32')
47+
```
4948
这个数据以float的形式被存储在GPU上(`dtype`被定义为`theano.confug.floatX`)。然后再将标签转换为int型。
5049
如果你再GPU上跑代码,并且数据集太大,可能导致内存崩溃。在这个时候,你就应当把数据存储为共享变量。你可以将数据储存为一个充分小的大块(几个minibatch)在一个共享变量里面,然后在训练的时候使用它。一旦你使用了这个大块,更新它储存的值。这将最小化CPU和GPU的内存交换。
5150

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ This is a `Chinese tutorial` which is translated from [DeepLearning 0.1 document
2727
* Miscellaneous
2828

2929

30-
\frac{d}{dx}\sin x=\cos x
3130

3231

3332

0 commit comments

Comments
 (0)