Welcome friends,
It's time for neural networks!
- It should work the same way as previous HW. Download, compile, test.
- This time we are testing by training on various image recognition datasets (MNIST/CIFAR)
- The repo is tested to work with Linux and MacOS.
First, clone the repository!
git clone https://github.com/holynski/cse576_sp20_hw3.git
Download the necessary datasets. Once you run the commands they should be present in the directories mnist and cifar.
./download_datasets.sh
You can now compile your code with cmake.
mkdir build
cd build
cmake ..
make
Or you can use the convenience script to provide
./compile.sh
This will build the code and create an executable on the project directory called train. You can run it as:
./train
We've added a bunch of new data structures and types to use for building up neural networks and other machine learning algorithms. Please read through these source files before starting the assignment--it may save you a lot of time.
The core of this assignment will be the Matrix class. You should read through src/matrix.h and get a feel for what is there.
Here are some pointers to get you started:
- The basic arithmetic operators are overloaded so if you have two matrices
AandByou can callA-B,A+B,A/B. - The
*operator represents matrix multiplication soA*Bwill matrix-multiplyAandB. - If you want to do elementwise multiplication you can call
Matrix::elementwise_multiply(A, B) - We also provide other functions such as
transpose()so please make use if them.
This file contains definitions for constructing a neural network.
Layer: this class will contain the definition as well as the weights and gradient updates of a neural network layer.Activation: this is anenumwhich defines different types of activation functions.
We will be training a fully-connected neural network for this assignment.
src/activation.cpp: you will implement the forward and backward passes for several activation functions.src/classifier.cpp: you will implement gradient computation and parameter updates using algorithms we discussed in class.
You'll be training on two datasets, one is MNIST which is a digit-recognition dataset. The other is a simple visual recognition dataset called CIFAR.
An important part of machine learning, be it linear classifiers or neural networks, is the activation function you use.
We will be implementing the following activation functions:
- Linear:
f(x) = x - Logistic:
f(x) = 1 / (1 + exp(-x)) - tanh:
f(x) = tanh(x) - ReLU:
f(x) = max(0, x) - Leaky ReLU:
f(x) = 0.01*x if x < 0 else x - Softmax: https://en.wikipedia.org/wiki/Softmax_function
Fill in the Matrix forward_*(const Matrix& matrix) functions in src/activations.cpp.
- The input is a non-activated layer output
matrixand the outputf(matrix)is the activation applied elementwise (so you may need to writeforloops over all pixels). - Remember that for our softmax activation we will take
e^xfor every elementx, but then we have to normalize each element by the sum as well. Each row in the matrix is a separate data point so we want to normalize over each data point separately.
We will now compute the backwards pass for each activation. Each function signature looks like
Matrix backward_*(const Matrix& out, const Matrix& prev_grad)
where out is the activated layer output and prev_grad is the gradient of the next layer (towards the loss function).
To compute the backward pass:
- First compute gradient of the activation function
f'(x)i.e. the partial derivative of the activated layer output with respect to the non-activated layer output. - Then multiply the gradient contribution of the activation by the gradient from the next layer and return it.
Note that we compute the gradient only based on the activated output:
- Normally, to calculate
f'(x)we would need to remember what the input to the function,x, was. - However, all of our activation functions have the nice property that we can calculate the gradient
f'(x)only with the outputf(x). This means we have to remember less stuff when running our model.
To get you started, we'll provide you with the derivative of the logistic function. You can read more about it here.
# Derivative of the logistic (sigmoid) function.
f'(x) = f(x) * (1 - f(x))
Compute the symbolic derivatives of each of the activation functions (feel free to look to the slides if necessary).
- For Leaky ReLU, use a slope of 0.01.
- As discussed in the backpropagation slides, we just multiply our partial derivative by
f'(x)to backpropagate through an activation function.
Notes on Softmax:
- The gradient computation for Softmax is a bit trickier than the other ones since it's a vector-to-vector function.
- We've provided some skeleton code to get you started. You must first compute the Jacobian of the Softmax function in
softmax_jacobian. - Use the Jacobian in
backward_softmaxto then compute the actual backward pass. - Here are some excellent notes: https://mattpetersen.github.io/softmax-with-cross-entropy
Fill in the Matrix backward_*(const Matrix& out, const Matrix& prev_grad) functions in src/activations.cpp to multiply the elements of prev_grad by the correct gradient where out is the output of a layer.
Now we can fill in how to forward-propagate information through a layer. First check out our layer struct:
struct Layer
{
// Runtime Data terms
Matrix in; // Input to a layer (aka x)
Matrix out1; // Output before activation (aka xw)
Matrix out2; // Output after activation (actual output, aka y)
// Backpass saved terms
Matrix grad_out1;
Matrix grad_in;
// Weight and weight management
Matrix w; // Current weights for a layer
Matrix grad_w; // Current weight updates
Matrix v; // Past weight updates (for use with momentum)
// Type
Activation activation; // Activation the layer uses
......
}
During forward propagation we will do a few things.
- We'll multiply the input matrix by our weight matrix
w. - We'll apply the activation function
activation. - We'll also save the input and output matrices in the layer so that we can use them during backpropagation. But this is already done for you.
Fill in the TODOs in src/classifier.cpp:
Matrix forward_weights(const Layer &l, const Matrix &in)- Hint: Each row in the input
inis a separate data point and each row in the returned matrix is the result of passing that data point through the layer. - Using matrix operations we can batch-process data.
- Think about the shapes of the matrices what the multiplication order should be.
- Hint: Each row in the input
Matrix forward_activation(const Layer &l, const Matrix &out1).- Hint: make use of the function
Matrix forward_activate_matrix(const Matrix &matrix, Activation a)
- Hint: make use of the function
See Matrix Layer::forward(const Matrix& in) to see how they are connected.
We have to backpropagate error through our layer.
- We will be given
dL/dy, the derivative of the loss with respect to the output of the layer,y. - The output
yis given byy = f(xw)wherexis the input,wis our weights for that layer, andfis the activation function. - What we want to calculate is
dL/dwso we can update our weights anddL/dxwhich is the backpropagated loss to the previous layer.
Look at Matrix Layer::backward(const Matrix& grad_y) to see how they are connected.
First we need to calculate dL/d(xw) using the gradient of our activation function. Recall:
Use the Matrix backward_activate_matrix(const Matrix& out, const Matrix& prev_grad, Activation a) function to compute derivative dL/d(xw) from dL/dy.
Fill in Matrix backward_xw(const Layer &l, const Matrix &grad_y) in src/classifier.cpp.
Next we want to calculate the derivative with respect to our weights, dL/dw. Recall:
But remember from the slides, to make the matrix dimensions work out right we acutally do the matrix operation
where x^T is the transpose of the input matrix x.
In our layer we saved the input as l.in.
- Calculate
x^Tusing that and the matrix transpose function in our library:mat.transpose(). - Then calculate
dL/dwand save it intol->dw. We'll use this later when updating our weights.
Fill in the TODOs in Matrix backward_w(const Layer &l) in src/classifier.cpp.
Next we want to calculate the derivative with respect to the input as well, dL/dx. Recall:
Again, we have to make the matrix dimensions line up so it actually ends up being dL/d(xw) * wt where wt is the transpose of our weights, w.
Calculate wt and then calculate dL/dx. This is the matrix we will return.
Fill in the TODOs in Matrix backward_x(const Layer &l) in src/classifier.cpp.
After forward and backward propagation, we've computed the gradients of the loss w.r.t. the weights. We can use these gradients to perform gradient updates to the weights.
With learning rate (η), momentum (m) and weight decay (λ) our weight update rule is:
Some notes:
- We'll be doing gradient ascent (the partial derivative component is positive) instead of descent because we'll flip the sign of our partial derivative when we calculate it at the output.
- We saved
dL/dw_tasl.grad_w - By convention we'll use
l.vto store the previous weight changeΔw_{t-1}.
Now let's compute implement the update rule:
- Calculate the current weight change as a weighted sum of
l.grad_w,l.w, andl.v. Although these are vector quantities, our Matrix library treats them as a 1-dimensional matrix and you can use standard matrix-scalar products. - Save the current weight change in
l.vfor next round. - Finally, apply the weight change to your weights by adding a scaled amount of the change based on the learning rate.
Fill in void update_layer(Layer& l, double rate, double momentum, double decay) using the update rule.
Check out the remainder of the functions which string layers together to make a model, run the model forward and backward, update it, train it, etc. Notable highlights:
Makes a new layer that takes input number of inputs and produces output number of outputs. Note that we randomly initialize the weight vector, in this case with a uniform random distribution between [-sqrt(2/input), sqrt(2/input)]. The weights could be initialize with other random distributions but this one works pretty well. This is one of those secret tricks you just have to know from experience!
Will be useful to test out our model once we are done training it.
Calculates the cross-entropy loss for the ground-truth labels y and our predictions p. For classification of an image into one of say N object classes, y is an N-length vector with all zero values except for the object class present in the image, for which it is 1. Cross-entropy loss is negative log-likelihood (log-likelihood is for logistic regression) for multi-class classification. Since we added the negative sign it becomes a loss function that we want to minimize. Cross-entropy loss is given by:
L = Σ(-y_i log(p_i))
for a single data point, summing across the different classes. If the output of our final layer uses a softmax: y = σ(wx) then:
dL/d(wx) = -Σ(y_i / p_i)
Our training code to implement SGD. First we get a random subset of the data using Data Data::random_batch(int batch_size). Then we run our model forward and calculate the error we make dL/dy. Finally we backpropagate the error through the model and update the weights at every layer.
We'll be testing out your implementation on the MNIST dataset!
Run download_datasets.sh to download the MNIST and CIFAR datasets. If not running on Linux read the file to see where the datasets would have to go and download manually.
Check out train.cpp to see how we're actually going to run the machine learning code you wrote.
There are a couple example models in here for softmax regression and neural networks.
Run the program train to train a softmax regression model on MNIST. You will see a bunch of numbers scrolling by, this is the loss calculated by the model for the current batch. Hopefully over time this loss goes down as the model improves.
After training, the program will calculate the accuracy the model gets on both the training dataset and a separate testing dataset. You should get over 90% accuracy.
Answer questions 2.2.1-2.2.3 in questions.txt.
Now change the training code to use the neural network model instead of the softmax model.
Answer questions 2.3.1-2.3.6 in questions.txt.
While the derivative of L2 loss is straightforward, the gradient of L1 loss is constant and will affect the training (either the accuracy will be low or the model will converge to a large loss within a few iterations.)
To avoid this, compute the Huber loss instead of L1 and write Huber loss equation in l1_loss(). In simple words, Huber loss is equal to L1 loss when y - p is > 1 and equal to L2 loss when <= 1. You may have to be careful about the sign (i.e. y-p or p-y).
The CIFAR-10 dataset is meant to be similar to MNIST but much more challenging. Using the same model we just created, let's try training on CIFAR.
If you haven't already, run download_datasets.sh to download the CIFAR dataset.
Modify train.cpp to use CIFAR. This should mostly involve swapping out references to mnist with cifar during dataset loading. Then try training the network. You may have to fiddle with the learning rate to get it to train well.
Answer questions 3.2.1 in questions.txt.
Turn in your activations.cpp, classifier.cpp, and questions.txt on canvas under Assignment 3.
- Make sure to test early and often. Make good use of prints.
- If you get NaNs, suspect you have a divide by zero somewhere.
- gdb is a great tool for debugging especially when your code is crashing.
- If you find that debugging is too slow, it's probably because of the data loading code.
You can modify the data loading function in
src/train.cppto load the test set as the training set (since the test set is much smaller). Just remember to switch it back after debugging! - We provide you with some tools in the
Matrixclass for easy debugging. You can useMatrix::print(int maxrows, int maxcols)to print out matrices for debugging. - We also provide a handy
debug_printfunction inclassifier.cppwhich will only print if you callset_verbose(true)first.
forward_linear 1
forward_logistic 2
forward_tanh 2
forward_relu 1
forward_lrelu 1
forward_softmax 3
backward_linear 1
backward_logistic 2
backward_tanh 3
backward_relu 2
backward_lrelu 2
backward_softmax 3
softmax_jacobian 3
forward_weights 4
forward_activation 4
backward_xw 5
backward_w 6
backward_x 6
update_layer 9
Q2.2.1 4
Q2.2.2 4
Q2.2.3 4
Q2.3.1 4
Q2.3.2 4
Q2.3.3 4
Q2.3.4 4
Q2.3.5 4
Q2.3.6 4
Q3.2.1 4
Total 100