Backpropagation Revision PDF

Title	Backpropagation Revision
Author	Bruno Alves Pinto
Course	Neuronale Netze
Institution	Karlsruher Institut für Technologie
Pages	4
File Size	112.6 KB
File Type	PDF
Total Downloads	57
Total Views	139

Preview

CLICK TO PREVIEW PDF

Summary

Paper...

Description

Back-propagation and Stochastic Gradient Descent

Ngoc-Quan Pham and Kay Rotmann and Alex Waibel Karlsruhe Institute of Technology {ngoc.pham, kay.rotmann, alex.waibel}@kit.edu

1

Introduction - Multi-layer Perceptron

This manuscript is to clarify MLP and Backpropagation which hopes to compensate for the materials provided in the lectures. In the following, we will consider a simple classification problem. To maintain the generalizability, I will only present the concept of classification instead of specific problems. Basic Notation for the basic MLP with one hidden layer: NX

: continuous vector representation X ∈ R of the input with NX dimensions. It can be the (flattened) image (from 2D to 1D) or in general the most basic description of the input. H ∈ RNH : hidden layer with NH dimensions (or neurons). O ∈ RNO : output layer with NO dimensions (or neurons) (before the softmax) P ∈ {0, 1}N0 : probability distribution of the output given the input (after the softmax) The Weight matrices: W H ∈ RNH ×NX the weight matrix between the input layer and the hidden layer. bH ∈ RNH the bias vector for the hidden layer. W O ∈ RNO ×NH the weight matrix between the hidden layer and the output layer. bO ∈ RNO the bias vector for the output layer. f a non linear function (sigmoid, tangent or ReLU). They are element-wise functions unlike the weight-sum as in fully-connected layers. The term ‘layer’ is often assigned with the activation function. For example if one says, “a hidden layer with ReLU function followed by another hidden layer with sigmoid activation function” we understand that it basically means that there are two ‘layers’ and not four.

2

Transformation logic

hidden layer The input is first transformed into a hidden representation. Each neuron in the hidden layer is the weighted sum of all elements input layer. The weights come from out weight matrices WH Hi = f (

NX X

WjiH Xj + biH )

(1)

j

Above was the mapping from each element j of the input to an element i of the hidden layer via the weight Wij . The whole transformation for the layer can be done in parallel, in the form of matrixmatrix multiplication: H = f (W H X + bH )

(2)

Disclaimer: look at the dimension of the matrix W H which is N H × N X and the vector size of X which is N X so that the M.Multiplication can be done correctly. In some other sources, the equation looks a bit different: H = f (XW H + bH ) which indicates that the size of W H is N X × N H . output layer The transformation from H to O is almost identical to the counterpart from X to H , but we do not need any activation function. O = (W O H + bO )

(3)

Each neuron in the output layer is associated with one class in the classification problem. For example, if we want to classify the input into 10 classes, then the output layer will have 10 neurons, each of which is the score indicating the prediction of the network towards that class. Classification problems normally require the output layer to follow a Multinomial (or Categorical) distribution, which can be interpreted that

the network estimates the conditional distribution P (Y |X) in which Y is a particular class. In that case, the Softmax function is applied so that

th

exp(Oi ) Pi = P j exp(Oj )

3

Training

Training the network simply means ”changing the weights” (W H , W O , bH , bO ) (the learnable parameters) so that the network can minimize the Loss function for the whole training data D.

(4) L=

The i component of P corresponds to the estimated probability of the ith class given the input feature X . Note: The weights (W H , W O , bH , bO ) are learnable parameters. But the size of the hidden layers is a hyper-parameter (cannot be learned, but chosen manually before setting up the network). Loss function Here the loss function is to minimize the difference between the output distribution from the model (P ) and the ground truth distribution. If P is continuous and dense (no element being 0 due to softmax), for example P = [0.20.30.5], we often have the label for the class (for example one of the three classes). In such case the “ground truth distribution” is represented as a “one hot vector” (which you have seen from the character-based language model), which is a sparse vector with all elements being 0 and only the index of the class being 1. For example if we need to classify three different labels: Car, Bicycle and Plane. There are three different ground truth distributions for this problem.

D X

LC E(N et(Xj ), Yj )

(6)

j

N et(Xj ) is the output of the network given data sample Xj (the output layer P). Different than Perceptron, the MLP weights cannot be updated as simply, but to use Gradient Descent. Gradient Descent is an approach for this kind of machine learning model (which comes from the theory of Function Approximation and Optimization) which tells us that: for each parameter w of the network (no matter where it is in W H , or W O ), if we can estimate the gradient of the loss function with respect to (w.r.t) w, noted as ∂L , we can then update the weights by changing ∂w weights to the opposite direction of the gradients:

• if the class is Car, then y = [0, 0, 1]

∂L (7) ∂w in which α is a learning rate deciding the magnitude of each update. The question is how can ∂L we get the gradients ∂w . Here is where Backpropagation becomes crucial because it makes gradient-based learning tractable for neural networks (Otherwise do you remember the numerical gradients we get from the exercise, that is another way to get the gradients, but ||W || times slower than back-propagation).

• if the class is Bicycle, then y = [0, 1, 0]

4

• if the class is Plane, then y = [1, 0, 0]

Lets look at the first matrix W H mapping X to H . The relationship between W H and output layer P is a composition function that looks like:

The most common loss function to minimize the difference between the model distribution and the ground truth distribution is the Cross-Entropy Loss: LC E(P, y) = −

X

yi log(Pi )

(5)

i

In the case that the ground truth is an one-hot vector, the elements in the sum of the LC E will be 0 except for the position of the label. In this case the loss function is also called Negative LogLikelihood loss function.

w ←w−α

Back-propagation

P = softmax(W O tanh(W H X + bH ) + bO ) (8) For deeper (more layers) networks, this composite function looks even more intractable. So ∂L estimating the gradients, for example ∂W H is not easy. However, there are layers (or variables) which are closer to the output layer and the loss function than the weights, which makes computing the gradients w.r.t to them easy. The output layer was the closest to the loss function so the gradients should be simple:

X ∂L ∂Pk 1 −yk × = ∂Oi ∂Oi Pk k X = −yi (1 − Pi ) + yk Pi

X ∂log(Pk ) ∂L =− yk ∂Pi ∂Pi k

=−

X k

1 ∂log(Pk ) ∂Pk = −yi yk Pi ∂Pk ∂Pi

k6=i

= −yi + yi Pi +

X

yk Pi

k6=i

= Pi − yi ∂Pk ∂Pi

The last transformation is possible because is 0 everywhere and 1 when i = k. Which makes sense, because the loss is the sum so the gradient should be equivalently distributed for each element in P . From here, I would use the term “Backpropagate” which means using the chain rule to make gradient computation tractable. Specifically for a feed-forward network, when we want to com∂L pute the gradients at a particular layer H l , i.e ∂H l, ∂L this term is estimated based on the ∂H l+1 and the chain rule. For more complicated networks such as recurrent nets, H l can be connected to different nodes at the same time (i.e the output layer O at the same timestep with H l , and the hidden layer of the next time step. In that case we need to combine the gradients accordingly. In this specific net∂L work, assuming that we compute the term ∂P and i stored it for the next computational step. output layer Next, we will back-propagate to ∂L the layer closest to P which is O. ∂O is the term i that we want to compute now. Because according to the softmax function, each neuron in O is connected to all neurons in P . So:

Which is in line with what we have seen in the softmax layer’s gradient at the assignment. In ∂L , we ofpractice since the gradient is nice for ∂O i ten ignore P in the chain rule. To simplify the equation for the whole vector, we can write: ∂L =P −y ∂O At the hidden layer From the inference equation for the output layer. We need to backpropagate the gradients to the previous layer: O = W O H + bO This implies that: ∂L ∂L =P −y = ∂O ∂bO (as a whole vector because the sum between W O H and bO is element-wise operation and the left hand side is canceled out for derivatives.) For the weight matrix W O , lets look at the element wise operation: Oi =

NH X

WijO Hj + biO

j

X ∂L ∂L ∂Pk = × ∂Oi ∂Pk ∂Oi k The first term in the sum is known, which is −yk P1 . So we just need to compute the second k term. The gradient of the softmax function is given as:

∂L ∂L × Hj = → O ∂Oi ∂W ij ∂L So the gradient of the weight matrix ∂W O is a huge matrix like this:   ∂L ∂L ∂L × H0 × H1 . . . ∂O × HN H ∂O0 ∂O0 0  ∂L ∂L ∂L × H × H1 . . . ∂O × HN H  0   ∂O1 ∂O1 1    ... ... ... ...  ∂L ∂L ∂L × H × H . . . × H H 0 1 N ∂O ∂O ∂O O NO

∂Pk = ∂Oi



Pi (1 − Pi ) if i = k −Pi Pk if i 6= k

So the gradient at the output layer should be:

NO

N

This big matrix can be shortly written as the outer product like this: ∂L T ∂L H = (P − y)H T = ∂W O ∂O

• The chain rule establishes a connection from the a variable, for example W H to the loss function to analytically compute the gradients

Long story short, we found the actual equation ∂L for ∂W O by using the chain rule and turn the difficult gradient computation into manageable one. ∂L But we still have the gradients for ∂W H so we have to maintain the chain longer.

Oi =

NH X

• During the “backward” pass we do not need ∂L the intermediate derivatives, for example ∂Z to update the parameters themselves, we use the intermediate derivatives to make the computation tractable.

WijOHj + bO i

j

X ∂L ∂Oi ∂L = × → ∂H ∂O ∂Hj j i i X ∂L ∂L = = WjiO × ∂Hj ∂O i i Which basically means that since Hj is connected to all units in the output layer O, so the gradient should be the sum of composite gradients like above. I changed the order in the final equation so we can write it in the compact matrix form: ∂L T T ∂L = W O (P − y) = WO ∂O ∂H i At the input layer Applying the same technique, you can quickly find out that we can compute the gradient for the weight matrices W H : H = f (W H X + bH ) = f (Z) ∂L ∂H ∂L = × → ∂Z ∂H ∂Z The term ∂H depends on the activation function ∂Z we use. After that, we can compute the gradients w.r.t the biases and the weights of the first fully connected layer, in the same manner with the upper layer. ∂L ∂L = H ∂b ∂Z

∂L T ∂L X = ∂W H ∂Z

Thats the end of the back-propagation process for this network. The key idea is:

• The process of back-propagation is scalable (linearly) to the depth of the network. The only requirement is that each function is the chain rule must be differentiable.

5

Some common inputs / outputs

In the first part, X is required to be ∈ RNX basically a continuous (real) vector with NX dimensions. In some case it is required to have one more step to achieve this: • 2-Dimensional inputs like images / speech. It is possible to ”flatten” the matrix, i.e rearranging elements of the matrix into a vector. • Discrete symbols such as characters/words (in natural language processing). It is common to represent them as one-hot vectors (similar to the way we represent the label). This approach is also called continuous space representation, because we turn a discrete label into a continuous one, whose values are updated as gradients for the neural network....