Exercise solution 04 linear regression PDF

Title	Exercise solution 04 linear regression
Author	Hans Hsu
Course	Machine learning
Institution	Technische Universität München
Pages	8
File Size	205.5 KB
File Type	PDF
Total Downloads	71
Total Views	132

Preview

CLICK TO PREVIEW PDF

Summary

Download Exercise solution 04 linear regression PDF

Description

Sheet 04 · Page 1

Machine Learning — WS2019 — Module IN2064

Machine Learning Exercise Sheet 04

Linear Regression

Homework Least squares regression Problem 1: Let’s assume we have a dataset where each datapoint, (xi , yi ) is weighted by a scalar factor which we will call ti . We will assume that ti > 0 for all i. This makes the sum of squares error function look like the following: N 2 1X  T ti w φ(xi ) − yi Eweighted (w) = 2 i=1

Find the equation for the value of w that minimizes this error function. Furthermore, explain how this weighting factor, ti , can be interpreted in terms of 1) the variance of the noise on the data and 2) data points for which there are exact copies in the dataset. If we define T = diag(t1 , . . . , tN ) to be a diagonal matrix containing the weighting coefficients, then we can write the weighted sum-of-squares cost function in the form 1 Eweighted (w) = (Φw − y)T T (Φw − y) 2 Setting the derivative with respect to w to zero, and re-arranging, then gives w∗weighted = (ΦT T Φ)−1 ΦT T y which reduces to the standard solution for the case T = I. I.e. wML = (ΦT Φ)−1 ΦT y If you remember back to when we modeled the likelihood using a Gaussian, our likelihood had the following form:

p(y | Φ, w, β) =

N Y i=1

N (yi | wT φ(xi), β −1 )

Upload a single PDF file with your homework solution to Moodle by 10.11.2019, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Sheet 04 · Page 2

Machine Learning — WS2019 — Module IN2064

After applying the logarithm and using the standard form for the univariate Gaussian our equation looked like this:

ln p(y | Φ, w, β) = =

N X i=1

ln N (yn | wT φ(xi ), β −1 )

N N ln(2π) − βELS (w) ln β − 2 2

Where ELS(w) is the standard sum of squares error function (not to be confused with the Eweighted (w) we defined earlier). Remember that ELS (w) is defined as follows: N

1X T ELS(w) = (w φ(xi ) − yi )2 2 i=1

When we compare ELS with Eweighted and the effect of swapping the two in the previous likelihood equation we can see that Ti can be regarded as a precision (inverse variance) parameter, particular to the data point (xi , yi ), that either replaces or scales β . Alternatively, ti can be regarded as an effective number of replicated observations of data point (xi , yi ); this becomes particularly clear if we consider Eweighted (w) with ti taking positive integer values, although it is valid for any ti > 0.

Ridge regression Problem 2: Show that the following holds: The ridge regression estimates can be obtained by ordinary least squares regression on an augmented dataset: Augment the design matrix Φ ∈ RN ×M with M √ additional rows λI M ×M and augment y with M zeros. Ordinary least squares minimizes (y − Φw)T (y  − Φw). we need to minimize  For ridge  regression  Φ y ˆ = √ and y ˆ= (y − Φw)T (y − Φw) + λwT w. If we define Φ , we can formulate the ridge λI 0M ˆ ˆ T (ˆ regression objective as minimizing (ˆy − Φw) y − Φw). Problem 3:

Derive the closed form solution for ridge regression error function Eridge(w) =

N 1X T λ (w φ(xi ) − yi )2 + wT w 2 2 i=1

Additionally, discuss the scenario when the number of training samples N is smaller than the number of basis functions M . What computational issues arise in this case? How does regularization address them?

Upload a single PDF file with your homework solution to Moodle by 10.11.2019, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Sheet 04 · Page 3

Machine Learning — WS2019 — Module IN2064

N

Eridge(w) =

1X T λ (w φ(xi ) − yi )2 + wT w 2 2 i=1

λ 1 = (Φw − y)T (Φw − y) + wT w 2 2 Taking the gradient ∇w Eridge(w) = ΦT Φw − ΦT y + λw

= (ΦT Φ + λI)w − ΦT y

Set it to zero (ΦT Φ + λI )w = ΦT y w = (ΦT Φ + λI)−1 ΦT y If N < M the covariance matrix ΦT Φ ∈ RM ×M will be singular, therefore not invertible. (this may happen even if N ≥ M , e.g. when some features are correlated). When regularization is used, λI is added to the covariance matrix, thus fixing the potential degeneracy issue and making the problem tractable.

Upload a single PDF file with your homework solution to Moodle by 10.11.2019, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Sheet 04 · Page 4

Machine Learning — WS2019 — Module IN2064

Multi-output linear regression Problem 4: In class, we only considered functions of the form f : Rn → R. What about the general case of f : Rn → Rm ? For linear regression with multiple outputs, write down the loglikelihood formulation and derive the MLE of the parameters. The observation y i is a vector with y i ∼ N (W xi , Σ), W ∈ Rn×m , Σ ∈ Rm×m , covariance Σ is known and fixed for all possible observations. For n i.i.d observed pairs (xi , y i ), the likelihood is Y N (y i |W T xi , Σ), p(y i |W , Σ) = i

and thus the negative log-likelihood is ln p(y i |W , Σ) = const +

1X (y i − W T xi )T Σ−1 (y i − W T xi ) 2 i

Since Σ is symmetric and positive semi-definite we can use the Cholesky decomposition Σ = LLT to obtain Σ−1 = L−T L−1 . With this we can simplify the two factors via ˆ T xi ). L−1 (y i − W T xi ) = (L−1 y i − L−1 W T xi ) = ( yˆi − W Using this transformation we get ln p(y i |W , Σ) = const + = const + = const + = const + = const + = const +

1X ˆ T xi ) ˆ T xi )T (ˆ yi − W yi − W (ˆ 2 i 1 XX ˆ T x i )j ˆ T xi )j (ˆ (ˆ yi − W yi − W 2 i j 1 XX ˆ ˆ )ij )(Yˆij − (X W ˆ )ij ) (Yij − (X W 2 i j   T   X X 1 ˆ ˆ Yˆ − X W Yˆ − X W ij 2 ji i j  1 X ˆ ˆ )T (Yˆ − X W ˆ ) (Y − X W jj 2 j i 1 h ˆ ˆ )T (Yˆ − XW ˆ) , Tr (Y − X W 2

where Yˆ and X are matrices that have the vectors y ˆ i and xi as their rows. Note that the trace essentially just denotes the sum over all dimensions of the output domain. Taking the derivative of ˆ and setting it to 0 gives W ˆ MLE = (X T X)−1 X T Yˆ . So the negative log-likelihood with respect to W ˆ ˆ MLE these are m single least square problems, one for every column of Y . Finally, transforming W T ˆ MLEL . back gives WMLE = W

Upload a single PDF file with your homework solution to Moodle by 10.11.2019, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Sheet 04 · Page 5

Machine Learning — WS2019 — Module IN2064

Comparison of Linear Regression Models Problem 5: We want to perform regression on a dataset consisting of N samples xi ∈ RD with corresponding targets yi ∈ R (represented compactly as X ∈ RN ×D and y ∈ RN ). Assume that we have fitted an L2 -regularized linear regression model and obtained the optimal weight vector w∗ ∈ RD as N

w∗ = arg min w

1X T λ (w xi − yi )2 + wT w 2 i=1 2

Note that there is no bias term. Now, assume that we obtained a new data matrix X new by scaling all samples by the same positive factor a ∈ (0, ∞). That is, X new = aX (and respectively xinew = axi ). a) Find the weight vector wnew that will produce the same predictions on X new as w∗ produces on X. Predictions of a linear regression model are generated as yˆ = wT x. This means that we need to ensure that w∗T xi = wTnew xinew or equivalently w∗T xi = wTnew axi . ∗ Solving for wnew we get wnew = wa b) Find the regularization factor λnew ∈ R, such that the solution w∗new of the new L2 -regularized linear regression problem ∗ wnew = arg min w

N 1 X T new λnew T w w (w X i − yi )2 + 2 2 i=1

will produce the same predictions on X new as w∗ produces on X . Provide a mathematical justification for your answer. The closed form solution for w∗ on the original data X is w∗ = (X T X + λI)−1 X T y The closed form solution for w∗new on the new data X new is T y w∗new = (X Tnew Xnew + λnew I)−1 Xnew

= a(a2 X T X + λnew I)−1 X T y by setting λnew = a2 λ, we get = a(a2 X T X + a2 λI)−1 X T y 1 = (X T X + λI)−1 X T y a 1 ∗ = w a

Upload a single PDF file with your homework solution to Moodle by 10.11.2019, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Sheet 04 · Page 6

Machine Learning — WS2019 — Module IN2064

Which (according to our answer in part (a) of this problem) will produce the same predictions on X new as w∗ does on X, as desired. Equivalent solution N

!

∗ = wnew

λ 1 w∗ 1X T (w xi − yi )2 + wT w = arg min a a 2 2 i=1 w N

=

1 X wT 1 a2 λ wT w arg min ( axi − yi )2 + a 2 i=1 a 2 a a w

=

1X T a2 λ T a arg min w wnew (wnew xinew − yi )2 + a w new = w 2 2 new

N

a

i=1

!

= w∗new = arg min w new

N λnew T 1X T (w new xinew − yi )2 + wnew wnew 2 2 i=1

For this equality to hold we need to match the regularization term by setting λnew = a2 λ.

Programming Task: Least squares regression Problem 6: Load the notebook 04_homework_linear_regression.ipynb from Piazza. Fill in the missing code and run the notebook. Convert the evaluated notebook to pdf and add it to the printout of your homework. Note: We suggest that you use Anaconda for installing Python and Jupyter, as well as for managing packages. We recommend that you use Python 3. For more information on Jupyter notebooks and how to convert them to other formats, consult the Jupyter documentation and nbconvert documentation.

Upload a single PDF file with your homework solution to Moodle by 10.11.2019, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Sheet 04 · Page 7

Machine Learning — WS2019 — Module IN2064

In-class Exercises Problem 7: Assume that we are given a dataset, where each sample xi and regression target yi is generated according to the following process xi ∼ Uniform(−10, 10)

yi = ax3i + bx2i + cxi + d + ǫi ,

where

ǫi ∼ N (0, 1) and a, b, c, d ∈ R.

The 3 regression algorithms below are applied to the given data. Your task is to say what the bias and variance of these models are (low or high). Provide a 1-2 sentence explanation to each of your answers. a) Linear regression Bias: high. Variance: low. A straight line cannot capture a degree 3 polynomial (thus underfitting the data). b) Polynomial regression with degree 3 Bias: low. Variance: low. The model is same as the data generating process. We can achieve a good fit. c) Polynomial regression with degree 10 Bias: low. Variance: high. Since we are using a polynomial regression with a degree much higher compared to the data generating process, the model will overfit the data.

Problem 8: Given is a training set consisting of samples X = {x1 , x2 , . . . , xN } with respective regression targets y = {y1 , y2 , . . . , yN } where xi ∈ RD and yi ∈ R. Alice fits a linear regression model f (xi ) = wT xi to the dataset using the closed form solution for linear regression (normal equations). Bob has heard that by transforming the inputs xi with a vector-valued function φ, he can fit an alternative function, g(xi ) = vT φ(xi), using the same procedure (solving the normal equations). He decides to use a linear transformation φ(xi) = AT xi , where A ∈ RD×D has full rank. a) Show that Bob’s procedure will fit the same function as Alice’s original procedure, that is f (x) = g(x) for all x ∈ RD (given that w and v minimize the training set error). By fitting g(x) = vT φ(x), Bob can only set the function equal to linear combinations of the inputs (vT AT )x = wT x, where w = Av. Moreover, just like Alice, all linear combinations are available: any function Alice fits can be matched by setting v = A−1 w.

Upload a single PDF file with your homework solution to Moodle by 10.11.2019, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Machine Learning — WS2019 — Module IN2064

Sheet 04 · Page 8

As both Alice and Bob are selecting the function that best matches the outputs from the same set of functions, and with the same cost function, they will select the same function. b) Can Bob’s procedure lead to a lower training set error than Alice’s if the matrix A is not invertible? Explain your answer. Since A is square and not invertible, then multiple input vectors are transformed to the same output vector. It’s not possible for Bob to assign different function values to two such inputs, whereas Alice can. Bob can no longer fit all the same functions as Alice. Furthermore, Alice can still match any of Bob’s solutions by setting w = Av. Since the error function is quadratic and PSD it has a unique (set of) optima. So either Alice’s (set of) optima will include Bob’s solution or it will be lower. In summary, Bob’s training error might be worse than Alice’s, but it can’t be better. Note that we are only talking about training error in this example, not test error. Bob might manage to find a model that generalizes better than Alice’s, but Alice will always be able to fit the training data at least as well as Bob.

Problem 9:

See Jupyter notebook inclass_04_notebook.ipynb.

Upload a single PDF file with your homework solution to Moodle by 10.11.2019, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that....