1- .. _gettingstarted:
1+ .. _gettingstarted:
22
33
44===============
@@ -55,9 +55,7 @@ MNIST Dataset
5555 images. An image is represented as numpy 1-dimensional array of 784 (28
5656 x 28) float values between 0 and 1 (0 stands for black, 1 for white).
5757 The labels are numbers between 0 and 9 indicating which digit the image
58- represents. When using the dataset, we usually divide it in minibatches
59- (see :ref:`opt_SGD`). The code block below shows how to load the
60- dataset and how to divide it in minibatches of a given size :
58+ represents. The code block below shows how to load the dataset.
6159
6260
6361 .. code-block:: python
@@ -69,43 +67,59 @@ MNIST Dataset
6967 train_set, valid_set, test_set = cPickle.load(f)
7068 f.close()
7169
72- # make minibatches of size 20
73- batch_size = 20 # sized of the minibatch
74-
75- # Dealing with the training set
76- # get the list of training images (x) and their labels (y)
77- (train_set_x, train_set_y) = train_set
78- # initialize the list of training minibatches with empty list
79- train_batches = []
80- for i in xrange(0, len(train_set_x), batch_size):
81- # add to the list of minibatches the minibatch starting at
82- # position i, ending at position i+batch_size
83- # a minibatch is a pair ; the first element of the pair is a list
84- # of datapoints, the second element is the list of corresponding
85- # labels
86- train_batches = train_batches + \
87- [(train_set_x[i:i+batch_size], train_set_y[i:i+batch_size])]
88-
89- # Dealing with the validation set
90- (valid_set_x, valid_set_y) = valid_set
91- # initialize the list of validation minibatches
92- valid_batches = []
93- for i in xrange(0, len(valid_set_x), batch_size):
94- valid_batches = valid_batches + \
95- [(valid_set_x[i:i+batch_size], valid_set_y[i:i+batch_size])]
96-
97- # Dealing with the testing set
98- (test_set_x, test_set_y) = test_set
99- # initialize the list of testing minibatches
100- test_batches = []
101- for i in xrange(0, len(test_set_x), batch_size):
102- test_batches = test_batches + \
103- [(test_set_x[i:i+batch_size], test_set_y[i:i+batch_size])]
104-
105-
106- # accessing training example i of minibatch j
107- image = training_set[j][0][i]
108- label = training_set[j][1][i]
70+
71+ When using the dataset, we usually divide it in minibatches (see
72+ :ref:`opt_SGD`). We encourage you to store the dataset into shared
73+ variables and access it based on the minibatch offset, given a fixed
74+ and known batch size. The reason behind shared variables is
75+ related to using the GPU. There is a large overhead when copying data
76+ into the GPU memory. If you would copy data on request ( each minibatch
77+ individually when needed) as the code will do if you do not use shared
78+ variables, due to this overhead, the GPU code will not be much faster
79+ then the CPU code (maybe even slower). If you have your data into a
80+ Theano shared variables though, you give Theano the possibility to copy
81+ the entire data on the GPU in a single call when the shared variables are constructed.
82+ Afterwards the GPU can access any minibatch by taking a slice from this
83+ shared variables, without needing to copy any information from the CPU
84+ memory and therefore bypassing the overhead.
85+ Because the datapoints and their labels are usually of different nature
86+ (labels are usually integers while datapoints are real numbers) we
87+ suggest to use different variables for labes and data. Also we recomand
88+ using different variables for the training set, validation set and
89+ testing set to make the code more readable (resulting in 6 different
90+ shared variables).
91+
92+ Since now the data is in one variable, and a minibatch is defined as a
93+ slice of that variable, it comes more natural to define a minibatch by
94+ indicating where the slice starts (the offset) and how large it is (the
95+ batch size). Note that since the batch size stays constant through out the
96+ execution of the code, a function will
97+ require only the offset as input in order to identify on which minibatch to work.
98+ The code below shows how to store your data and how to
99+ access a minibatch:
100+
101+
102+ .. code-block:: python
103+
104+ def shared_dataset(data_xy):
105+ data_x, data_y = data_xy
106+ shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
107+ shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
108+ return shared_x, T.cast(shared_y, 'int32')
109+
110+ test_set_x, test_set_y = shared_dataset(test_set)
111+ valid_set_x, valid_set_y = shared_dataset(valid_set)
112+ train_set_x, train_set_y = shared_dataset(train_set)
113+
114+ batch_size = 500 # size of the minibatch
115+
116+ # accessing the third minibatch of the training set
117+
118+ data = train_set_x[2*500:3*500]
119+ label = train_set_y[2*500:3*500]
120+
121+
122+
109123
110124
111125.. index:: Notation
@@ -503,7 +517,8 @@ of a strategy based on a geometrically increasing amount of patience.
503517 # validation error is found
504518 improvement_threshold = 0.995 # a relative improvement of this much is
505519 # considered significant
506- validation_frequency = 2500 # make this many SGD updates between validations
520+ validation_frequency = min(2500, patience/2.)
521+ # make this many SGD updates between validations
507522
508523 # initialize cross-validation variables
509524 best_params = None
@@ -547,6 +562,14 @@ of a strategy based on a geometrically increasing amount of patience.
547562If we run out of batches of training data before running out of patience, then
548563we just go back to the beginning of the training set and repeat.
549564
565+
566+ .. note::
567+
568+ The ``validation_frequency`` should always be smaller than the
569+ ``patience``. The code should check at least two times how it
570+ performs before running out of patience. This is the reason we used
571+ the formulation ``validation_frequency = min( 2500, patience/2.)``
572+
550573.. note::
551574
552575 This algorithm could possibly be improved by using a test of statistical significance
0 commit comments