Deep Learning SS2021 Exam Solution PDF

Title	Deep Learning SS2021 Exam Solution
Course	Introduction to Deep Learning
Institution	Technische Universität München
Pages	16
File Size	542.7 KB
File Type	PDF
Total Downloads	458
Total Views	510

Preview

CLICK TO PREVIEW PDF

Summary

Sample SolutionChair of Visual Computing & Artificial Intelligence Department of Informatics Technical University of MunichEsolution Place student sticker hereNote: - During the attendance check a sticker containing a unique code will be put on this exam. - This code contains a unique number tha...

Description

Chair of Visual Computing & Artiﬁcial Intelligence Department of Informatics Technical University of Munich

Note:

Esolution Place student sticker here

Date: Tuesday 13th July, 2021 Time: 17:30 – 19:00

Sa m pl

Working instructions

e

So

lu

Exam: IN2346 / Endterm Examiner: Prof. Dr. Matthias Nießner

tio

Introduction to Deep Learning

n

• During the attendance check a sticker containing a unique code will be put on this exam. • This code contains a unique number that associates this exam with your registration number. • This number is printed both next to the code and to the signature ﬁeld in the attendance check list.

• This exam consists of 16 pages with a total of 5 problems. Please make sure now that you received a complete copy of the exam.

• The total amount of achievable credits in this exam is 91 credits. • Detaching pages from the exam is prohibited. • Allowed resources: None

• Do not write with red or green colors

– Page 1 / 16 –

Problem 1

Multiple Choice (18 credits)

Below you can see how you can answer multiple choice questions. Mark correct answers with a cross To undo a cross, completely fill out the answer option To re-mark an option, use a human-readable marking

×  ×

• For all multiple choice questions any number of answers, i.e. either zero (!), one or multiple answers can be correct.

tio

n

• For each question, you’ll receive 2 points if all boxes are answered correctly (i.e. correct answers are checked, wrong answers are not checked) and 0 otherwise. 1.1 Which of the following models are unsupervised learning methods?

× Auto-Encoder Maximum Likelihood Estimate

lu

× K-means Clustering

So

Linear regression

1.2 In which cases would you usually reduce the learning rate when training a neural network?

× When the training loss stops decreasing To reduce memory consumption

e

After increasing the mini-batch size

Sa m pl

× After reducing the mini-batch size

1.3 Which techniques will typically decrease your training loss? Add additional training data

× Remove data augmentation × Add batch normalization Add dropout

1.4 Which techniques will typically decrease your validation loss?

× Add dropout × Add additional training data

Remove data augmentation Use ReLU activations instead of LeakyReLU

– Page 2 / 16 –

1.5 Which of the following are affected by multiplying the loss function by a constant positive value when using SGD? Memory consumption during training

× Magnitude of the gradient step Location of minima Number of mini-batches per epoch

n

1.6 Which of the following functions are not suitable as activation functions to add non-linearity to a network? sin(x )

tio

× ReLU(x) − ReLU(−x) log (ReLU(x) + 1)

lu

× log (ReLU(x + 1))

1.7 Which of the following introduce non-linearity in the neural network?

So

× LeakyReLU with α = 0 Convolution

× Batch Norm Skip connection

Sa m pl

is robust to outliers

e

1.8 Compared to the L1 loss, the L2 loss...

is costly to compute

× has a different optimum

will lead to sparser solutions

1.9 Which of the following datasets are NOT i.i.d. (independent and identically distributed)? A sequence (toss number, result) of 10,000 coin ﬂips using biased coins with p(toss result = 1) = 0.7

set of (image, label) pairs where each image is a frame in a video and each label × Aindicates whether that frame contains humans. × A monthly sample of Munich’s population over the past 100 years A set of (image, number) pairs where each image is a chest X-ray of a different human and each number represents the volume of their lungs.

– Page 3 / 16 –

Problem 2

2 3

0 1 2

Improve generalization by adding more data and preventing overﬁtting (1p) ["make training set larger" without mentioning generalization/overﬁtting -0.5p ] Rotation, cropping, color jittering, salt/paper, ﬂipping, translation jitter ... (0.5p for each)

n

1

2.1 Explain the idea of data augmentation (1p). Specify 4 different data augmentation techniques you can apply on a dataset of RGB images (2p).

2.2 You are training a deep neural network for the task of binary classiﬁcation using the Binary Cross Entropy loss. What is the expected loss value for the ﬁrst mini-batch with batch size N = 64 P for an untrained, randomly initialized network? Hint: BCE = − 1N yi log yˆ i + (1 − yi ) log(1 − yˆ i )

tio

0

Short Questions (29 credits)

ReLU: constant 0 for negative values (0.5p) LeakyReLU: pre-deﬁned slope for negative values (0.5p) Parametric ReLU: learnable value for slope, either 1 for all channels or 1 for each channels. (1p) (’learnable parameter’ without specifying what is learning 0.5p)

Sa m pl

2

So

1

2.3 Explain the differences between ReLU, LeakyReLU and Parametric ReLU.

e

0

lu

−log(0.5) or log(2) -0.5p for 1/64 or wrong sign

0 1 2

2.4 How will weights be initialized by Xavier initialization? Which mean and variance will the weights have? Which mean and variance will the output data have? With Xavier initialization we initialize the weights to be Gaussian with zero mean and variance Var(w) = 1/n where n is the amount of neurons in the input. As a result, the output will have zero mean, and similar variance as the input

– Page 4 / 16 –

2.5 Why do we often refer to L2-regularization as “weight decay”? Derive a mathematical expression that includes the weights W , the learning rate η, and the L2 regularization hyperparameter λ to explain your point.

0 1 2

Reg = 0.5 · λ · ||W ||2

3

Upon a gradient update: Wnew = W − η · ∇Reg = W − η · λW = (1 − η · λ)W

tio

n

w is multiplied by a positive scalar 64. List, explicitly, two different approaches that would allow this. Your new network should support varying image sizes in run-time, without having to be re-trained. • fully convolutional: 1x1 conv or removing fc1, using FCN, U-Net (1p)

• preprocess: resize, crop, pre-process all the input images to 64x64 (1p) • ADAPTIVE pooling (Average, Max etc) with ﬁxed output size before the fc1 (1p) • pooling with ﬁxed output size and does not mention "ADAPTIVE" explicitly (0.5p) • if approaches too similar (e.g. 1x1 conv + FCN) (1p in total) • not the intended answer but ok: Graph Neural Network, RNN or LSTM, Transformer • No: add conv in the beginning, cannot produce ﬁxed size output for variable image size

– Page 8 / 16 –

Problem 4

Optimization (13 credits)

4.1 Explain the idea behind the RMSProp optimizer. How does it enable faster convergence than standard SGD? How does it make use of the gradient? • Idea (1 pt): RMSProp divides the learning rate by an exponentially-decaying average of squared gradients. / adaptive,scale, dividing learning rate method. Only formula is not enough (asked for explanation).

0 1 2 3

• How does it enable faster convergence than standard SGD ? (1 pt) bigger learning rate where possible / dampening the oscillations for high-variance directions

n

• How does it make use of the gradient? (1 points)

tio

– exponentially-decaying average of squared gradients (for high-variance directions)

– key: squared gradients/ second moment/ exponentially-decaying average squared gradients

lu

– formula is ok

So

RMSProp does not have momentum!

4.2 What is the bias correction in the ADAM optimizer? Explain the problem that it ﬁxes.

Sa m pl

e

When accumulating gradients in a weighted average fashion, the ﬁrst gradient is initialized to zero. This biases all the accumulated gradients down towards zero. The Bias correction normalizes the magnitude of the accumulated gradient for early steps. • Bias correction (1 pt): normalizing/scaling the magnitude of the accumulated gradient for early steps/ formula that has the division part with 1 − β k +1 • Problem it ﬁxes (1 pt): We initialize m0 and v0 with 0. In the ﬁrst few steps our moving average is heavily biased towards the initial, zero initialization of m0

– Page 9 / 16 –

0 1 2

0 1

4.3 You read that when training deeper networks, you may suffer from the vanishing gradients problem. Explain what are vanishing gradients in the context of deep convolutional networks and the underlying cause of the problem.

2

vanishing gradients are gradients with a very small magnitude (causing a meaningless update step) (1 pt) Caused by: vanishing gradients: very small magnitude of the gradients which leads to poor updates./ During the back-propagation, the gradient become smaller and smaller (or does not help to improve network anymore) (1 pt) if the answer is 0 gradients, -0.5. If the answer missing back-propagation when it says become smaller and smaller, -0.5 cause of the problem in the context of deep convs: (2 points) => (any reason is ﬁne to get the full credits)

n

3

tio

• multiplying the small gradients will lead to vanishing gradients/when we multiply the magnitude of the gradients smaller than < 1, too many layers and chain rule => 2 points • the network is deep and we multiply many small weights => 2 points

lu

• due of the saturated activation function (only activation function is not enough, not all of them saturate (1 pt only)) => 2 points

So

• poor initialization of network/ weight is too small or large (need to explain how it )=> 2 points

Sa m pl

e

• the ouput is small/ the input is small/ the data is small => 1 point (not directly relates to the deep network)

– Page 10 / 16 –

4.4 In the following image you can see a segment of a very deep architecture that uses residual connections. How are residual connections helpful against vanishing gradients? Demonstrate this mathematically by performing a weight update for w0 . Make sure to explain how this reduces the effect of vanishing gradients. Hint: Write the mathematical expression for ∂∂Wz 0 w.r.t all other weights.

0 1 2 3 4

Sa m p

ol

ut io

n

5

1 point brief explanation 2 points chain rule * If not complete, still give 1 point for correct evaluation of derivatives in chain rule 1 point evaluating chain rule to show +1 1 point ﬁnal explanation

Additional: 2 points for ﬁnal explanation if it is good and they omit brief explanation -1 point if they don’t write the full chain rule down to dz/dw0 -1 point small errors in arithmetic

– Page 11 / 16 –

Problem 5

Multi-Class Classiﬁcation (18 credits)

Note: If you cannot solve a sub-question and need its answer for a calculation in following subquestions, mark it as such and use a symbolic placeholder (i.e., the mathematical expression you could not explicitly calculate + a note that it is missing from the previous question.)

1

It normalizes the logits/scores to be positive and sum up to 1, making the output of the network a valid probability distribution (2 pt). (allows to think of a probability of belonging to a class / allows to use CE) 0 pt: its derivative can be expressed in terms of the softmax function itself / generalized sigmoid (how does this help multiclass classiﬁcation?) / properties That are not related to multiclass classiﬁcation (e.g., numerical stability), 1 pt: say only outputs between [0,1]

Sa m pl

e

2

5.1 Why does one use a Softmax activation at the end of such a classiﬁcation network? What property does it have that makes it a common choice for a classiﬁcation task?

So

0

lu

tio

n

Assume you are given a labeled dataset {X, y } , where each sample xi belongs to one of C = 10 classes. We denote its corresponding label yi ∈ {1, ..., 10} . In addition, you can assume each data sample is a row vector. You are asked to train a classiﬁer for this classiﬁcation task, namely, a 2-layer fully-connected network. For a visualization of the setting, refer to the following illustration:

0 1

5.2 For a vector of logits ~z , the Softmax function σ : RC → RC , is deﬁned: e zi yˆi = σ (~z)i = PC zj j=1 e

where C is the number of classes and zi is the i -th logit. A special property of this function is that its derivative can be expressed in terms of the Softmax function itself. How could this be advantageous for training neural networks? In the forward pass, the values of the softmax function are already evaluated and can be cached/saved for a more efﬁcient backward pass in training, by simply plugging them into that gradient formula. (1 pt) 0.5 pt: efﬁcient backward without saying how or why

– Page 12 / 16 –

5.3 Show explicitly how this can be done, by writing

∂ yˆ i ∂ zi

in terms of yˆ i .

0 1

zi

∂ ∂ (zi ) hP

e zi · = P

j

j

e

(yˆi ) = 

e zj − e zi

 P z j

e ·

·

j

e zj

P

i

j

zj

e −e ·e

P

j

e zj

e zi · zj je

!

2

=

3 P

j

e zj − e zi = zj je

P

= yˆ i · (1 − yˆi )

n

j

zi

2

 =P

e zj ez = yˆ i · P zj − P i zj je je P

zi

lu

tio

1 pt quotient rule and simpliﬁcation 2 pt put back yi and get ﬁnal answer -1 not using yi (e.g., using sigma) -0.5 pt small math error, almost correct

∂ ∂ (zj )

j

−e zj · e zi e

 P z j

·

P

e zj − e zj · e zi · e zi

 =

−P

0·

e zj

j

P

zj je

e

= P

(yˆi ) =

j

∂ yˆ i ∂ zj

So

5.4 Similarly, show explicitly how this can be done, by writing i= 6 j.

=

e zj e zi · P = −yˆ iyˆj zj zj je je

Sa m pl

1 pt quotient rule and simpliﬁcation 1 pt put back yi , yj and get ﬁnal answer

2

in terms of ˆyi and yˆj , for

– Page 13 / 16 –

0 1 2

0 1 2 3

5.5 Using the Softmax activation, what loss function L (y, ˆy ) would you want to minimize, to train a network on such a multi-class classiﬁcation task? Name this loss function (1 pt), and write down its formula (2 pt), for a single sample x , in terms of the network’s prediction yˆ and C its true is a one-hot encoded vector:  label y . Here, you can assume the label y ∈ {0, 1}  1, if i == true class index yi =  0, otherwise CE (y, yˆ ) = − Cj=1 yj log yˆj where j = true class P

or

CE (y, yˆ ) = − log yˆ j

• name 1 pt + formula 2 pt • forget minus - lose 0.5 pt (we minimize this loss) • normalize by 1/C - OK

0 1

5.6 Having done a forward pass with our sample x , we will back-propagate through the network. 2 We want to perform a gradient update for the weight wj,k (the weight which is in row j , column 2 k of the second weights’ matrix W ). First, use the chain rule to write down the derivative ∂L as a product of 3 partial derivatives (no need to compute them). For convenience, you can ∂ wj,k ignore the bias and omit the 2 superscript.

e

2

So

lu

• formula has another sum over all data - OK

tio

• 2nd version ok, since labels are one-hot vectors here

n

• not Binary! (0 pt for binary) Cross Entropy loss / softmax loss (colloquial term)

Sa m pl

First, we write the Chain rule:

∂L ∂yˆ ∂ z ∂L = · · ∂ wj,k ∂ yˆ ∂ z ∂ wj,k

– Page 14 / 16 –

2 5.7 Now, compute the gradient for the weight: w3,1 . For this, you will need to compute each of the partial derivatives you have written above, and perform the multiplication to get the ﬁnal answer. You can assume the ground-truth label for the sample was true_class = 3 . Hint: The derivative of the logarithm is (log t)′ = 1t .

0 1 2 3

For CE loss, the loss only depends on the prediction of yˆ true , that is yˆ 3 in this case. ∂L ∂ (− log yˆ3 ) 1 = − (1pt ) = yˆ 3 ∂ yˆ 3 ∂ yˆ 3

5

We are only missing

∂ z1 . ∂ w3,1

That comes from matrix multiplication.

So

1 so ∂∂wz3,1 = h3 (1 pt). Finally, combining everything yields:

lu

H X ∂ z1 hk wk ,1 = ∂ w3,1 k =1

∂L 1 = − · − yˆ 3yˆ1 · h3 = ˆy1 h3 (2pt ) yˆ3 ∂ w1,3

Sa m pl

e

wrong sign (lose 0.5 pt)

tio

n

yˆ 3 is affected by all of the entries of the vector z , because of the softmax. Note that w3,1 only affects z1 (z = h · W ), and from previous subquestions,

∂ yˆ 3 = − yˆ 3yˆ1 (1pt ) ∂ z1

– Page 15 / 16 –

4

Sa m pl

e

So

lu

tio

n

Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike out invalid solutions.

– Page 16 / 16 –...