Homework 1 Solution PDF

Title	Homework 1 Solution
Author	Sean Huang
Course	Optimization Models in Engineering
Institution	University of California, Berkeley
Pages	15
File Size	423.2 KB
File Type	PDF
Total Downloads	76
Total Views	134

Preview

CLICK TO PREVIEW PDF

Summary

Solutions...

Description

1

EECS 127–Fall 2020 Homework 8 Zehao Huang

EECS 127 1. Formulating problems as LPs or QPs (a) Is the robust problem (3) convex? Proof. Yes it is! The objective function is basically f0 = maxu:||u||∞ ≤λ ||v + u − x||22 . Observe that ||v + u − x||22 is a convex function of x and u. As a result, f0 is the pointwise maximum of a family of convex functions. Thus f0 is convex and our problem is also convex. ✷ (b) Show that problem (3) can be expressed as another form. Proof. max u:||u||∞ ≤λ

||v + u −

x1||22

=

n X i=1

max (vi + ui − x)2

|ui |≤λ

Since in this case, x and vi are just constants, then we only need to maximize |vi +ui − x|. ( λ , vi − x ≥ 0 ui∗ = −λ , vi − x < 0 ( (λ + |vi − x|) , vi − x ≥ 0 max (vi + ui − x)2 = |ui |≤λ (−λ − |vi − x|) , vi − x < 0 = (|vi − x| + λ)2 As a result, replacing the inner max statement inside (3), we get min x

max

u:||u||∞ ≤λ

||v + u − x1||22 = min x

n X

(|vi − x| + λ)2

i=1

✷ (c) Express problem (2) as an LP. State precisely the variables and the constraints if any. min x,t

subject to:

1T t, t ∈ Rn v − x1 − t ≤ 0 x1 − v − t ≤ 0

(d) Express problem (3) as a QP. State precisely the variables and the constraints, if any. min x,t

subject to:

tT It + 2λ1T t + λ2 , t ∈ Rn v − x1 − t ≤ 0 x1 − v − t ≤ 0

EECS 127–Fall 2020 Homework 8 Zehao Huang

2

(e) Show that when λ is large the solution set of the problem in (3) approaches that of the median problem (2). Proof. From the QP construction of (3), as λ gets large, we know that the terms 2λ1T t+λ2 would dominate the term tT It. Thus our problem would mainly be minimizing the terms 2λ1T t + λ2 . This would be the same as minimizing 1T t because λ is constant. Thus the solution set actually approaches that of (2). ✷ (f) Based on the previous part of this question, justify this statement. Proof. This is just because we are solving a more robust problem to get the median. And the limiting problem when λ gets large in our problem is approaching the problem for solving the median. Thus the median is also a more robust notion of being in middle of a bunch of real numbers. ✷

prob3 - Jupyter Notebook

11/8/20, 6 :19 PM

Comparing Classiﬁcation Methods In this problem, we compare using three methods for classiﬁcation: Ridge Regression, which we've seen over and over gain in this course, a soft-margin Support Vector Machine, which was the de-facto method for classifcation for a large chunk of machine learning's history, and LASSO (Least Absolute Shrinkage and Selection Operator) created by our colleagues over at Stanford. Although Robert Tibshirani (author of the paper that introduced LASSO) aknwoeldges and thanks Leo Breiman (very famous Berkeley professor that unforunately passed away in 2005) for sharing his garotte paper with him before publication. 2

The problem we will try to tackle is a very simple binary classiﬁcation problem in𝑅 . The 2 training data has a variety of coordinates in 𝑅 some of which have been assigned to class 𝐶0 and other to class 𝐶1 . Your goal for each classiﬁcation method is the following: given a coordinate, determine whether this point belongs to 𝐶0 or 𝐶1 . As we shall see even though all three methods attempt to solve the same problem, the way we forumalate the problem actually makes a diﬀerence, i.e. these methods will most likely not perform equally!

Imports/Utils/Loading Data

http://localhost:8888/notebooks/prob3.ipynb

Page 1 of 9

prob3 - Jupyter Notebook

11/8/20, 6 :19 PM

In [2]: import pandas as pd import numpy as np import matplotlib.pyplot as plt from from from from

sklearn.model_selection import train_test_split sklearn.svm import SVC sklearn.linear_model import Ridge sklearn.linear_model import Lasso

from sklearn.metrics import accuracy_score from plot_boundary import plot_boundary from numpy.polynomial.polynomial import polyval %matplotlib inline %config InlineBackend.figure_format = 'svg' from sklearn.preprocessing import PolynomialFeatures

# load training data train_data = np.load("data/train.npy") X_train = train_data[:, 1:] y_train = train_data[:, 0] # load test data test_data= np.load("data/test.npy") X_test = test_data[:, 1:] y_test = test_data[:, 0]

Visualizing the data To make the problem clearer, let's plot the training data and see what it looks like, lucklily being two-dimensional coordinates this is can be nicely done. In the following plot the orange data points corresponds to 𝐶1 (class = 1) and blue data points correspond to 𝐶0 (class = 0). Something to reﬂect on: Is the data linearly separable? Why would we care?

http://localhost:8888/notebooks/prob3.ipynb

Page 2 of 9

prob3 - Jupyter Notebook

11/8/20, 6 :19 PM

In [3]: plt.scatter(X_train[y_train==0,0], X_train[y_train==0,1], alpha=0.4, s plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1], alpha=0.4, s plt.title("Training Data") plt.show()

Competitor N.1: Ridge Regression Ridge regression is a regression technique that is quite similar to regular least squares linear regression: simply adding an ℓ2 penalty on the parameters𝑤 to the objective function for linear regression yields the objective function for ridge regression. Our goal is to solve the following problem:

min ‖𝑋𝑤 − 𝑌 ‖22 + 𝜆‖𝑤 ‖22,

𝑤∈𝑅𝑛

where 𝜆 is a hyperparameter and, 𝑋 is the training data and𝑌 the observations vector. In practice, we tune 𝜆 until we ﬁnd a model that generalizes well to the test data. There is no algorithm to ﬁnd the best 𝜆 , usually what is done is we check various values and see which one gives the best results using the test data. Ridge regression is an example of a shrinkage method: compared to least squares, it shrinks the parameter estimates in the hopes of reducing variance, improving prediction accuracy, and aiding interpetation. Intuitively, we see right away that𝑤 's with large norms will probably not work even if our residual error is 0. http://localhost:8888/notebooks/prob3.ipynb

Page 3 of 9

prob3 - Jupyter Notebook

11/8/20, 6 :19 PM

a) Fill out the following code block to run ridge regression to classify the data. In [27]: fitted_model = None # your trained model (as trained by scipy) y_pred_sign = None # the prediction of your trained model on the testi # convert the labels from 0 and 1 to -1 and 1 y_train_sign = np.array(y_train) y_test_sign = np.array(y_test) y_train_sign[y_train_sign == 0] = -1 y_test_sign[y_test_sign == 0] = -1 # for the regularization parameter lambda, try choosing values on diff llambda = 0.5 ########## Your beautiful code starts here ########## # TODO: train a fitted_model and run prediction to generate y_pred_sig # Hint lookup the sign function to get a prediction consistent with ou model = Ridge(alpha=llambda) model.fit(X_train, y_train_sign) y_pred_sign = np.sign(model.predict(X_test)) fitted_model = model ########## Your beautiful code ends here ########## accuracy = accuracy_score(y_pred_sign, y_test_sign) print("Test Accuracy: {}".format(accuracy)) plot_boundary(X_test, y_test, fitted_model) Test Accuracy: 0.855

http://localhost:8888/notebooks/prob3.ipynb

Page 4 of 9

prob3 - Jupyter Notebook

11/8/20, 6 :19 PM

Competitor N.2: Least Absolute Shrinkage and Selection Operator (LASSO) Lasso is somewhat similar to ridge regression, if you compare the formulation of the two problems you will notice that the only diﬀerence is in the penalty function. In ridge regression we shrink the coeﬃcients since multipliying by 𝜆 makes it unfavorable to pick𝑤 's with large coeﬃcients. LASSO, as the name suggests, not only shrinks the coeﬃcients but it also "selects" some of them to be 0. In a hand wavey sense it attempts to understand which features (i.e. elememts) of 𝑤 don't really help us in the classiﬁcation problem. Using LASSO, our goal is to solve the following problem:

min ‖𝑋𝑤 − 𝑌 ‖22 + 𝜆‖𝑤 ‖1,

𝑤∈𝑅𝑛

where 𝜆 is a hyperparameter and, 𝑋 is the training data and𝑌 the observations vector. In practice, we tune 𝜆 until we ﬁnd a model that generalizes well to the test data. There is no algorithm to ﬁnd the best 𝜆 , usually what is done is we check various values and see which one gives the best results using the test data.

b) Fill out the following code block to run LASSO to classify the data. In [28]: fitted_model_lasso = None # your trained model (as trained by scipy) y_pred_sign_lasso = None # the prediction of your trained model on the # convert the labels from 0 and 1 to -1 and 1 y_train_sign = np.array(y_train) y_test_sign = np.array(y_test) y_train_sign[y_train_sign == 0] = -1 y_test_sign[y_test_sign == 0] = -1 # for the regularization parameter lambda, try choosing values of diff llambda = 0.5 ########## Your beautiful code starts here ########## # TODO: train a fitted_model and run prediction to generate y_pred_sig

http://localhost:8888/notebooks/prob3.ipynb

Page 5 of 9

prob3 - Jupyter Notebook

11/8/20, 6 :19 PM

model = Lasso(alpha=llambda) model.fit(X_train, y_train_sign) y_pred_sign_lasso = np.sign(model.predict(X_test)) fitted_model_lasso = model ########## Your beautiful code ends here ########## accuracy = accuracy_score(y_pred_sign_lasso, y_test_sign) print("Test Accuracy: {}".format(accuracy)) plot_boundary(X_test, y_test, fitted_model_lasso) Test Accuracy: 0.875

http://localhost:8888/notebooks/prob3.ipynb

Page 6 of 9

prob3 - Jupyter Notebook

11/8/20, 6 :19 PM

Competitor N.3: Soft-Margin Support Vector Machine In lecture and Q4 of HW8 you have been familiarized with what is known as the Hard-Margin SVM. In this problem we will be using the soft-margin SVM which is deﬁned as follows:

min

𝑤∈ℝ 𝑚 , 𝑏∈ℝ, 𝜁𝑖 ∈ℝ𝑛

𝑛 1 2 ‖𝑤 ‖2 + 𝐶 ∑ 𝜁𝑖 2 𝑖=1

s.t. 1 − 𝜁𝑖 − 𝑦𝑖(𝑥⊤𝑖 𝑤 − 𝑏) ≤ 0 𝜁𝑖 ≥ 0 where 𝑥 𝑖 is the 𝑖 th data point, 𝑦𝑖 ∈ controls how "soft" the margin is.

{−1, 1} is the label, and 𝐶 is a hyperparameter that

Notice that we simply added some slack terms 𝜁𝑖 's. Why did we do this? As you may know Hard-Margin SVM's require the data to be lienarly seprable otherwise you woulnd't have any feasible points. Soft-Margin SVM's allow us to ﬁnd a decision boundary even if the data is not linearly separable, as it is the case in this exercise. By adding slack terms we are allowing some points to violate the margin, however, the more you violate the margin the "worse" you will do since you will be adding positive terms multiplied by some positive 𝑛 constant C to the objective, i.e. 𝐶 ∑ 𝑖=1 𝜁𝑖 will drive objective value up as more points violate the margin. If 𝐶 → ∞ then we get back our Hard-Margin SVM, however it would miserably fail in this case since the data is not linearly separable. Why is this? We see that if𝐶 is very large then it becomes really expensive to violate the margin, since we are minimizing this is, the exact opposite of what we want. Here is a good resource if you would like to understand Soft-Margin SVM's better: https://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec6.pdf (https://people.eecs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec6.pdf) Notice that this is more complicated than the other two regression problems, when picking a classiﬁcation method this is also something you might want to take into account, how complicated your model is.

part c) Fill out the following code block to train a soft-margin svm and use it to classify the data.

http://localhost:8888/notebooks/prob3.ipynb

Page 7 of 9

prob3 - Jupyter Notebook

11/8/20, 6 :19 PM

In [33]: svc = None # your trained model y_pred_sign = None # the prediction of your trained model on the testi ########## Your beautiful code starts here ########## # c # #

TODO: Write code to train an SVM, and generate prediction y_pred, ch = .5 The documentation for sklearn's SVM: https://scikit-learn.org/stable Note your classifier should be called "svc" as given in the beginnin

model = SVC(C=c, gamma="auto") model.fit(X_train, y_train_sign) y_pred_sign = np.sign(model.predict(X_test)) svc = model ########## Your beautiful code ends here ########## accuracy = accuracy_score(y_pred_sign, y_test) print("Test Accuracy: {}".format(accuracy)) plot_boundary(X_test, y_test, svc) Test Accuracy: 0.47

In [ ]:

http://localhost:8888/notebooks/prob3.ipynb

Page 8 of 9

EECS 127–Fall 2020 Homework 8 Zehao Huang

11

2. A minimum time path problem We use l1 , l2 , l3 as our variables. The objective is the total time we spend on the path, namely l3 l2 l1 + + v1 v2 v3 1 = (l1 + 1.5l2 + 1.2l3 ) v1   1 = lT 1.5 1.2

f0 (l) =

The constraints are just the following. Equalities can be relaxed as inequalities because, if in any case the equality is not satisfied, we can just shrink that particular li to make objective even smaller. q p12 + 1 ≤ l1 p (p2 − p1 )2 + 1 ≤ l2 p (4 − p2 )2 + 1 ≤ l3 These can be rewritten as

  p || 1 ||2 ≤ l1 1   p − p1 || 2 ||2 ≤ l2 1   4 − p2 || ||2 ≤ l3 1

12

EECS 127–Fall 2020 Homework 8 Zehao Huang 3. LASSO vs Soft-Margin SVM vs Ridge Regression

(d) Comment on your results for each part. Which model performed the worst and why? LASSO performed the best because LASSO is an enhancement from Ridge regression and encourages sparse values. SVM performed the worst because there are a lot of overlaps in the data. (e) First decompose the problem into n univariate problems over each element of w. Let Xi denote the ith column of X . i. First decompose the problem into n univariate problems over each element of w arg min ||w||22 − 2wT X T y + ||y||22 + λ||w||1 w∈Rd

Thus we can decompose them for each coordinate wi , i.e. arg min w2i + λ|wi | − 2wi XiT y wi ∈R

ii. What is wi∗ and what is the condition on yT Xi for this to be possible? arg min wi2 + (λ − 2X iT y)wi = XiT y − ∗ w i ∈R

λ 2

This holds when X Ti y ≥ 12 . iii. What is wi∗ when it’s less than 0? arg min wi2 − (λ + 2X iT y)wi = XiT y + ∗ w i ∈R

λ 2

This holds when X Ti y ≤ − 21 . iv. What if when |yT Xi | ≤ λ2 . Proof. w∗i = 0. λ determines how many entries of the optimal value of w∗ would be set to 0. ✷

13

EECS 127–Fall 2020 Homework 8 Zehao Huang 4. Support vector machines (a) Show that the problem (5) can be solved via the QP

Proof. Note that ||w||2 and 21||w||22 have a one-to-one relationship. Thus it’s the same minimizing either one under the same set of constraints. 1 ||w||22 ∈ Rn 2 yi (wT xi + b) ≤ 1, i = 1, . . . , n

min x,t

subject to: ✷

(b) Show that the Lagrangian dual to the primal QP in (7) can be written as the QP Proof. n

X 1 λi (1 − yi (wT xi − b)) L(w, b, λ) = ||w||22 + 2 i=1

n n X X 1 λi yi λi + wT w − wT λi xi yi + b = 2 i=1 i=1 i=1 n X

Pn P Because b ni=1 λi yi can go arbitrarily small unless i=1 λi yi = 0, we need to impose this as a constraint. Thus we have our dual objective as g(λ) = inf L(w, b, λ) w,b

n

n

n X

X 1 1 X λi xi yi ||22 λi xi yi ||22 − || λi + ||w − w,b 2 2 i=1 i=1 i=1

= inf

In this form it’s not hard to see that w∗ = g(λ) =

n X

Pn

i=1 λi xi yi

n

1 X λi xi yi ||22 λi − || 2 i=1 i=1

=−

and we then have

n

n

n

X 1XX λi λj yi yj xTi xj + λi 2 i=1 j=1 i=1

Thus our problem is just min x,t

subject to:

−

n n n X 1 XX λi λj yi yj xTi xj + λi 2 i=1 j=1 i=1

n X

λi yi = 0

i=1

λi ≥ 0, i = 1, . . . , n ✷

EECS 127–Fall 2020 Homework 8 Zehao Huang

14

(c) Show that strong duality holds. Proof. Under slater’s condition, we just need to show that there exists a feasible w that strictly satisfies the constraint, i.e. yi (wT xi + b) > 1  x1T  xT   2 Let X =  . , then our constraints are just Xw + b1 = c ∗ y for some set of ci > 1.  ..  

xnT Since yi are just in {0, 1}, we can easily manipulate ci such that Xw + b1 = c ∗ y. Thus slater’s condition holds and strong duality also holds. ✷

(d) Using the KKT conditions for the problem to express w∗ , b∗ . Proof. Note that two points in each class must satisfy the constraint, or otherwise w can be made even smaller in norm so that the constraints are all still satisfied but the objective is smaller. As a result, the points that are the closest to the separating line should just be one for each class. For yi = 1, we just take the minimum distance whereas for yi = −1 we just take the maximum of the distance. Finally we have n n X X 1 b∗ := − ( max ( λj∗yj xjT xi )) λj∗yj xjT xi ) + min ( i:yi =1 2 i:yi =−1 j=1 j=1

✷

15

EECS 127–Fall 2020 Homework 8 Zehao Huang 5. Geometric programs (a) Is the problem as stated convex? Why or why not?

Proof. No it is not. Consider α = 1, α1 = α2 = 0.5, the resulting objective function is actually concave. ✷ (b) Show that y → log f (ey ) is convex. Proof. log f (ey ) = log δ(ey1 d1 +y2 d2 +...+yn dn ) n X = log δ + yi d i i=1

Obviously this function is convex. ✷ (c) Show that lse is a convex function on Rn . Proof. TODO! ✷ (d) Show that y → log f (ey ) is convex. Proof. Note that log f (ey ) = log(

n X

fi (ey ))

i=1

n X X exp( yi d i )) = log( i=1

Note that this is exactly a log

P

exp function of y and thus this function is convex. ✷

(e) Based on the observations in the three preceding parts of this question, convert the geometric program in (12) to a convex problem in standard form. Proof. We have the following standard forms min y

subject to:

lse(A0 y + b0 ) lse(Ai y + bi ) ≤ 0 lse(Ry + h) = 0

✷...