Solution exercise 04 linear regression PDF

Title	Solution exercise 04 linear regression
Course	Machine learning
Institution	Technische Universität München
Pages	9
File Size	233.2 KB
File Type	PDF
Total Downloads	38
Total Views	156

Preview

CLICK TO PREVIEW PDF

Summary

Download Solution exercise 04 linear regression PDF

Description

Sheet 04 · Page 1

Machine Learning — WS2020 — Module IN2064

Machine Learning Exercise Sheet 04

Linear Regression

Exercise sheets consist of two parts: homework and in-class exercises. You solve the homework exercises on your own or with your registered group and upload it to Moodle for a possible grade bonus. The inclass exercises will be solved and explained during the tutorial. You do not have to upload any solutions of the in-class exercises.

In-class Exercises Problem 1: Assume that we are given a dataset, where each sample xi and regression target yi is generated according to the following process xi ∼ Uniform(−10, 10)

yi = ax3i + bx2i + cxi + d + ǫi ,

where

ǫi ∼ N (0, 1) and a, b, c, d ∈ R.

The 3 regression algorithms below are applied to the given data. Your task is to say what the bias and variance of these models are (low or high). Provide a 1-2 sentence explanation to each of your answers. a) Linear regression Bias: high. Variance: low. A straight line cannot capture a degree 3 polynomial (thus underfitting the data). b) Polynomial regression with degree 3 Bias: low. Variance: low. The model is same as the data generating process. We can achieve a good fit. c) Polynomial regression with degree 10 Bias: low. Variance: high. Since we are using a polynomial regression with a degree much higher compared to the data generating process, the model will overfit the data. Problem 2: Given is a training set consisting of samples X = (x1 , x2 , . . . , xN )T with respective regression targets y = (y1 , y2 , . . . , yN )T where xi ∈ RD and yi ∈ R. Alice fits a linear regression model f (xi ) = wT xi to the dataset using the closed form solution for linear regression (normal equations). Upload a single PDF file with your homework solution to Moodle by 02.12.2020, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Sheet 04 · Page 2

Machine Learning — WS2020 — Module IN2064

Bob has heard that by transforming the inputs xi with a vector-valued function Φ , he can fit an alternative function, g(xi ) = vT Φ (xi ), using the same procedure (solving the normal equations). He decides to use a linear transformation Φ (xi ) = AT xi , where A ∈ RD×D has full rank. a) Show that Bob’s procedure will fit the same function as Alice’s original procedure, that is f (x) = g(x) for all x ∈ RD (given that w and v minimize the training set error). Alice uses the normal equation directly and obtains  −1 T w∗ = X T X X y.

Bob fits the model to the transformed data and obtains

 −1 v∗ = (XA)T (XA) (X A)T y  −1 = AT X T XA AT X T y −1  T −1 T T  A A X y = A−1 X T X  T −1 T −1 =A X X X y = A−1 w∗ .

(†) (‡)

(Note that Φtransforms the column vectors xi via AT xi but X contains the transposed observations xT i as rows. Therefore the transformed feature matrix is XA) Now it is immediate to see that (xi ) = w∗T A−T AT xi = w∗T xi = f (xi ). g(xi ) = v∗T Φ

b) Can Bob’s procedure lead to a lower training set error than Alice’s if the matrix A is not invertible? Explain your answer. Any weights v∗ Bob finds are also feasible for Alice by letting w = Av ∗ . Therefore Bob can only access a subset of the parameter space and cannot achieve a lower loss value than Alice. (It could still be equal but it cannot be better.) Note that we are only talking about training error in this example, not test error. Bob might manage to find a model that generalizes better than Alice’s, but Alice will always be able to fit the training data at least as well as Bob.

Problem 3:

See Jupyter notebook inclass_04_notebook.ipynb.

Homework Least squares regression Problem 4: Let’s assume we have a dataset where each datapoint, (xi , yi ) is weighted by a scalar factor which we will call ti . We will assume that ti > 0 for all i. This makes the sum of squares error function

Upload a single PDF file with your homework solution to Moodle by 02.12.2020, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Sheet 04 · Page 3

Machine Learning — WS2020 — Module IN2064

look like the following: Eweighted (w) =

N 2 1X  T ti w φ(xi ) − yi 2 i=1

Find the equation for the value of w that minimizes this error function. Furthermore, explain how this weighting factor, ti , can be interpreted in terms of 1) the variance of the noise on the data and 2) data points for which there are exact copies in the dataset. If we define T = diag(t1 , . . . , tN ) to be a diagonal matrix with t on the diagonal, we can write the weighted sum-of-squares cost function in the form 1 w − y)T T (Φ w − y ). Eweighted (w) = (Φ 2 Now we follow along the same steps that we used to derive the optimal solution for ordinary least squares and arrive at ∗ T T wweighted = (Φ TΦ )−1 Φ T y. Can we understand weighted linear regression in a probabilistic context as we did with ordinary least squares? There we had modeled the regression targets as i.i.d. random variables   yi ∼ N wT φ(xi ), β −1

with a common noise precision of β. From this we derived the form of the maximum likelihood error (negative log-likelihood) as N

N 1X T N (w φ(xi ) − yi )2 − ln β + ln 2π . EML(w, β) = β 2 2 i=1 | {z 2 } | {z } constant w.r.t. w ELS

But the least squares error part just stems from the definition of the normal distribution marked with (†) below. (†)

  N yi | wT φ(xi ), β −1 ∝ exp



z }| { β T 2 − (w φ(xi ) − yi ) 2

From this we deduce that weighted least squares is equivalent to probabilistic least squares where we choose β = ti , in effect modeling yi as   yi ∼ N wT φ(xi ), ti−1 making the regression targets no longer identically distributed but still independent.

For ti ∈ N, ti can be regarded as an effective number of replicated observations of data point (xi , yi ).

Upload a single PDF file with your homework solution to Moodle by 02.12.2020, 23:59 CET. We recommend to typeset your solution (using LATEX or Word), but handwritten solutions are also accepted. If your handwritten solution is illegible, it won’t be graded and you waive your right to dispute that.

Machine Learning — WS2020 — Module IN2064

Sheet 04 · Page 4

Ridge regression Problem 5: Show that ridge regression on a design matrix Φ ∈ RN ×M with regularization strength λ is equivalent to ordinary least squares regression with an augmented design matrix and target vector     Φ y ˆ √ and y ˆ= Φ = . λIM 0M Ordinary least squares minimizes and regression target, we get

1 (Φ w 2

− y)T (Φ w − y). Plugging in the augmented design matrix

√ T √  1 1 ˆ ˆ − ˆ y) = 1 (Φ w − ˆ y)T (Φ w w − y)T (Φ w − y) + (Φ λIM w λIM w 2 2 2 N λ 1X 2 kwk22 (Φ = i w − yi ) + 2 2 i=1

which is equal to the ridge regression loss function.

Implementation Problem 6: John Doe is a data scientist, and he wants to fit a polynomial regression model to his data. For this, he needs to choose the degree of the polynomial that works best for his problem. Unfortunately, John hasn’t attended IN2064, so he writes the following code for choosing the optimal degree of the polynomial: X , y = l oa d _ d at a () b es t _ e r r or = -1 b e st _ d e g r ee = No ne for d eg re e in r ang e (1 , 50 ): w = f i t _ p o l y n o m i a l _ r e g re ss i o n ( X , y , d eg re e ) y _ pr ed ic t ed = p r e d i c t _ p ol y n o m i a l _ r e gr es s io n (X , w , de gr ee ) er ror = c om p u t e _ me an _ sq u a r e d _ e r r o r (y , y _ pr ed ic t ed ) if ( e r ro r...