Dis2-sol - CS 189 Discussion 2 and Solution PDF

Title	Dis2-sol - CS 189 Discussion 2 and Solution
Course	Introduction to Machine Learning
Institution	University of California, Berkeley
Pages	9
File Size	194.6 KB
File Type	PDF
Total Downloads	72
Total Views	129

Preview

CLICK TO PREVIEW PDF

Summary

CS 189 Discussion 2 and Solution...

Description

CS 189

Introduction to Machine Learning

Spring 2019

Jonathan Shewchuk

1

DIS2

Back to Basics: Linear Algebra

Let X ∈ Rn×d . We do not assume that X is full rank. (a) Give the definition of the rowspace, columnspace, and nullspace of X. Solution: The rowspace is the span of the rows of X , the columnspace is the span of the columns of X, and nullspace is the set of vectors v such that Xv = 0. (b) Check the following facts: (a) The rowspace of X is the columnspace of X ⊤ , and vice versa. Solution: The rows of X are the columns of X ⊤ , and vice versa. (b) The nullspace of X and the rowspace of X are orthogonal complements. Solution: v is in the nullsapce of X if and only if Xv = 0, which is true if and only if for every row Xi of X, hXi , vi = 0. This is precisely the condition that v is perpedicular to each row of X . This means that v is in the nullspace of X if and only if v is in the orthogonal complement of the span of the rows of X, i.e. the orthogonal complement of the rowspace of X . (c) The nullspace of X ⊤ X is the same as the nullspace of X. Hint: if v is in the nullspace of X ⊤ X , then v ⊤ X ⊤ X v = 0. Solution: If v is in the nullspace of X, then X ⊤ X v = X ⊤ 0 = 0. On the other hand, if v is in the nullspace of X ⊤ X , then v ⊤ X ⊤ X v = v ⊤ 0 = 0. Then, v ⊤ X ⊤ Xv = kX v k22 = 0, which implies that Xv = 0. (d) The columspace and rowspace of X ⊤ X are the same, and are equal to the rowspace of X. Hint: Use the relationship between nullspace and rowspace. Solution: X ⊤ X is symmetric, and therefore its rows and columns are the same; hence, its columspaces and rowspaces are the same. By the previous problem, the nullspace of X ⊤ X is equal to the nullspace of X, therefore. Thus, rowspace(X) = nullspace(X)⊥ = nullspace(X ⊤ X)⊥ = rowspace(X ⊤ X), where ()⊥ denotes orthogonal complement.

2

Concentration Inequalities

For a given random variable, we are often interested in computing bounds on its tail, or on the probability that it deviates from its mean. In this problem we will prove a concentration inequality for

DIS2, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission.

1

sub-exponential random variables. A random variable X with mean E[X] = µ is sub-exponential if there are non-negative parameters (ν, b) such that 1

E[eλ(X−µ) ] ≤ e 2 ν

2 2

λ

for all |λ| <

1 b

(1)

We will prove that if X is sub-exponential,  2 t  − 2ν e 2 P (X ≥ µ + t) ≤ t e− 2b

if 0 ≤ t ≤ 2 if t > νb

ν2 b

(2)

(a) We will prove the case for µ = 0. First, use a Chernoff bound to show that P (X ≥ t) ≤ exp{−λt +

λ2 ν 2 } 2

(3)

for any λ ∈ [0, 1b ). Solution: To use a Chernoff bound, we write P (X ≥ t) = P (eλX ≥ eλt ) ≤ e−λt E[eλX ] Applying the definition of a sub-exponential variable, we have that e−λt E[eλX ] ≤ exp{−λt +

λ2 ν 2 } 2

2 2

(b) Define g(λ) = −λt + λ2 ν . Compute the optimal values of λ which minimize g under the constraint that λ ∈ (0, 1b ]. Solution: We first solve the unconstrained version of the problem by solving for the critical point of the quadratic. The derivative of g is g ′ (λ) = −t + λν 2 . Setting it to 0, we solve for 2 2 λ1 = νt2 . If 0 ≤ t < νb , λ1 < b1. In this case, g(λ1 ) = − 2νt 2 which completes the first case of 2 the proposition. If t ≥ νb , then since g is monotonically decreasing, the minimum is achieved ν2 ≤ − t at the boundary, λ2 = b1 . Then, g(λ2 ) = − bt + 2b , where the inequality follows from 2 2b 2 ν the fact that t ≥ b . (c) Use these values to deduce the tightest possible bound for the inequality from the first part, completing the proof. Solution: Simply plug in the values of λ1 and λ2 for the corresponding bounds on t from the last part.

DIS2, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission.

2

3

Vector Calculus

Below, x ∈ Rd means that x is a d × 1 (column) vector with real-valued entries. Likewise, A ∈ Rd×d means that A is a d × d matrix with real-valued entries. In this course, we will by convention consider vectors to be column vectors. ∂ denotes the derivative with Consider x, w ∈ Rd and A ∈ Rd×d . In the following questions, ∂x respect to x, while ∇x denote the gradient with respect to x. Compute the following:

Solution: A good resource for matrix calculus is the matrix cookbook: https://www.math. uwaterloo.ca/˜hwolkowi/matrixcookbook.pdf and the wikipedia page: https: //en.wikipedia.org/wiki/Matrix_calculus. Let us first understand the definition of the derivative. Let f : Rd → R denote a scalar function. ∂f is an operator that can help find the change in function value at x, up to first Then the derivative ∂x order, when we add a little perturbation ∆ to x. That is f (x + ∆) = f (x) +

∂f ∆ + o(k∆k) ∂x

(4)

where the term o(k∆k) stands for any term r(∆) such that r(∆)/k∆k → 0 as k∆k → 0. An example of such a term is a quadratic term like k∆k2 . Let us quickly verify that r(∆) = k∆k2 is indeed an o(k∆k) term. As k∆k → 0, we have r(∆) k∆k2 = k∆k → 0, = k∆k k∆k thereby verifying our claim. As a thumb rule, any term that has a higher order dependence on k∆k than linear is o(k∆k) and is ignored to compute the derivative. 1 ∂f df We call ∂x but it is okay to use ∂ to indicate as the derivative of f at x. Ideally, we should use dx that f may depend on some other variable too. (But to define ∂∂xf , we study changes in f with respect to changes in only x.) ∂f ∆ ∂x

∂f should be a row vector, since ∆ is a column vector. to be a scalar, the vector ∂x ∂f T The gradient of f at x is defined as the transpose of this derivative. That is ∇f (x) = ∂x . So one way to compute the derivative is to expand out f (x + ∆) and guess from the expression. We call this method, computation via first principle.

Note that for

We now write down some formulas that would be helpful to compute different derivatives in various settings where asolution via first principle might be hard to compute. We will also distinguish between the derivative, gradient, Jacobian and Hessian in our notation. 1. Let f : Rd → R denote a scalar function. Let x ∈ Rd denote a vector and A ∈ Rd×d denote a matrix. We have   ∂f ∂f ∂f ∂f ∂f 1×d ∈R such that (5) = ,..., , ∂x ∂x ∂xd ∂x1 ∂x2 1

Note that r(∆) =

p

p k∆k is not an o(k∆k) term. Since for this case, r(∆)/k∆k = 1/ k∆k → ∞ as k∆k → 0.

DIS2, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission.

3

∇x f =



∂f ∂x



∂f  1  ∂x ∂f   ∂x2   

T

 ..  . = . 

(6)

∂f ∂xd

2. Let y : Rm×n → R be a scalar function defined on the space of m × n matrices. Then its derivative is an n × m matrix and is given by   ∂y ∂y ∂y n×m . (7) ∈R such that = ∂B ∂B ij ∂Bj i An argument via first principles follows: ∂y y(B + ∆) = y(B) + trace( ∆) + o(k∆k). ∂B

(8)

∂z is an operator such that it can 3. For z : Rd → Rk a vector-valued function; its derivative ∂x help find the change in function value at x, up to first order, when we add a little perturbation ∆ to x:

z(x + ∆) = z(x) +

∂z ∆ + o(k∆k). ∂x

(9)

A formula for the same can be derived as 

∂z1 ∂x  ∂z  ∂x2 



∂z1 1  ∂x ∂z2  ∂x  1

  ∂z k×d J(z) = ∈R =  ..   =  .. ∂x  .   .

∂zd ∂x1

∂zd ∂x

that is





∂z [J(z)]ij = ∂x



=

∂zi . ∂xj

∂z2 ∂x1 ∂z2 ∂x2

... ...

∂z2 ∂xd

...

ij

∂z1 ∂x2 ∂z2 ∂x2

... ...

∂zd ∂x2

...



∂z1 ∂xd ∂z2  ∂xd  

∂zd ∂xd

, 

(10)

(11)

4. However, the Hessian of f is defined as

2

T



∂z1 ∂x1  ∂z  ∂x12 

H(f ) = ∇ f (x) = J(∇f ) =  ..  .

∂z1 ∂xd



∂zd ∂x1 ∂zd  ∂x2  



∂2f 2  ∂∂x2 f1   ∂x2 ∂x1

 =    ∂zd

∂xd

.. .

∂2f ∂xd ∂x1

∂2f ∂x1 ∂x2 ∂2f ∂x22

...

∂2f ∂xd ∂x2

...

...

∂2f ∂x1 ∂xd ∂2f ∂x2 ∂xd ∂2f ∂x2 . d

      

(12)

A first principle definition is given as: ∇f (x + ∆) ≈ ∇f (x) + ∇2 f (x)∆

DIS2, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission.

(13)

4

or equivalently ∇f (x + ∆) = ∇f (x) + ∇2 f (x)∆ + o(k∆k). For sufficiently smooth functions (when the mixed derivatives are equal), the Hessian is a symmetric matrix and in such cases (which cover a lot of cases in daily use) the convention does not matter. 5. The following linear algebra formulas are also helpful: (Ax)i =

d X

Aij xj ,

and,

(14)

j=1

T

(A x)i =

d X

AijT xj

=

j=1

(a)

∂w T x ∂x

d X

Aji xj .

(15)

j=1

and ∇x (wT x)

Solution: We discuss two ways to solve the problem. Using first principle:

We use f (x) = wT x. Then we have f (x + ∆) = wT x + wT ∆ = f (x) + wT ∆.

Comparing with equation (4), we conclude that ∂wT x = wT ∂x

and ∇x (wT x) =

!T ∂wT x = w. ∂x

The idea is to use f = wT x and apply equation (5). Note that wT x = j wj xj . Hence, we have P ∂ j wj xj ∂f = wi . = ∂xi ∂xi

Using the formula (5):

P

Thus, we find that P h P i h i P P ∂ j wj xj ∂wT x ∂ j wj xj ∂ j wj xj ∂ j wj xj = = w = wT . = , w , . . . , w , , . . . , 1 2 d ∂x1 ∂x2 ∂xd ∂x ∂x And ∇x (wT x) = (b)

∂(w T Ax) ∂x

∂w T xT ∂x

= w.

and ∇x (wT Ax)

Solution: We discuss three ways to solve the problem. DIS2, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission.

5

Using part (a): Note that we can solve this question simply by using part (a). We substitute u = AT w to obtain that f (x) = uT x. Now from part (a), we conclude that !T T ∂uT x ∂(wT Ax) ∂(w Ax) = AT w. = = uT = wT A and ∇x (wT Ax) = ∂x ∂x ∂x Using the first principle:

Taking f (x) = wT Ax and expanding, we have

f (x + ∆) = wT A(x + ∆) = wT Ax + wT A∆ = f (x) + wT A∆. Comparing with equation (4), we conclude that ∂wT Ax = wT A ∂x Using the formula (5):

fact that wT Ax = ∂ ∂f = ∂xj

i=1

j=1

and ∇x (w Ax) =

∂wT Ax ∂x

!T

= AT w.

T The Pdidea is to use f (x) = w Ax, and apply equation (5). Using the j=1 wi Aij xj , we find that i=1

Pd

Pd Pd

T

wi Aij xj

∂xj

=

∂

Pd

j=1

xj (

Pd

i=1

Aij wi )

∂xj

=

d X

Aij wi =

i=1

d X

ATji wi = (AT w)j ,

i=1

where in the last step we have used equation (15). Consequently, we have

and

i ∂(wT Ax) h T = (A w)1 , (AT w)2 , . . . , (AT w)d = (AT w)T = wT A, ∂x

∇x (wT Ax) =

(c)

∂(w T Ax) ∂w

∂(wT Ax) ∂x

!T

= AT w.

and ∇w (wT Ax)

Solution: We discuss three ways to solve the problem. Using part (a) and (b): Note that we can solve this question simply by using part (a) and (b). We have (wT Ax) = (xT AT w), since for a scalar α, we have α = αT . And in part (b), reversing the roles of x and w, we obtain that !T T ∂wT Ax ∂xT AT w ∂w Ax = = Ax. = xT AT and ∇w (wT Ax) = ∂w ∂w ∂w

DIS2, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission.

6

We now consider f as a function of w. Taking f (w) = wT Ax and

Using the first principle:

expanding, we have f (w + ∆) = (w + ∆)T Ax = wT Ax + ∆T Ax = wT Ax + xT AT ∆ = f (w) + xT AT ∆. Comparing with equation (4), we conclude that ∂wT Ax = x T AT ∂w

and ∇w (wT Ax) =

!T ∂wT Ax = Ax. ∂w

Using a similar idea as in the previous part, we have Pd Pd Pd Pd d wi ( j=1 Aij xj ) X ∂ i=1 ∂ i=1 j=1 wi Aij xj ∂f = = Aij xj = (Ax)i , = ∂wi ∂wi ∂wi

Using the formula (5)

j=1

where in the last step we have used equation (14). Consequently, we have i ∂(wT Ax) h = (Ax)1 , (Ax)2 , . . . , (Ax)d = (Ax)T = xT AT , ∂w and !T T ∂(w Ax) = (xT AT )T = Ax. ∇w (wT Ax) = ∂w (d)

∂(w T Ax) ∂A

and ∇A (wT Ax)

Solution: We discuss two approaches to solve this problem. Treating y = wT Ax as a function of A and expanding with respect to change in A, we have Using the first principle (8):

y(A + ∆) = wT (A + ∆)x = wT Ax + wT ∆x. Note that, for two matrices M ∈ Rm×n and N ∈ Rn×m , we have trace(MN) = trace(NM). Since wT ∆x is a scalar, we can write wT ∆x = trace(wT ∆x). And using the trace trick, we obtain wT ∆x = trace(wT ∆x) = trace(xw T ∆). Thus, we have y(A + ∆) = wT (A + ∆)x = wT Ax + wT ∆x = y(A) + trace(xw T ∆), which on comparison with equation (8) yields that ∂(wT Ax) = xwT ∂A

#T T ∂(w Ax) and ∇A (wT Ax) = = wxT . ∂A "

DIS2, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission.

7

We use y = wT Ax and apply the formula (7). We have wT Ax = wi Aij xj and hence # " ∂(wT Ax) ∂(wT Ax) = = wj xi = (xwT )ij . ∂A ∂Aj i

Using P the formula (7):

Pd

i=1

d j=1

ij

Consequently, we have ∂(wT Ax) = [(xwT )ij ] = xw T , ∂A and thereby ∇A (wT Ax) = wxT . (e)

∂(xT Ax) ∂x

and ∇x (xT Ax)

Solution: We provide three ways to solve this problem. Using the first principle:

Taking f (x) = xT Ax and expanding, we have

f (x + ∆) = (x + ∆)T A(x + ∆) = xT Ax + ∆T Ax + xT A∆ + ∆T A∆ = f (x) + (xT AT + xT A)∆ + O(k∆k2 ) which yields ∂(xT Ax) = xT (AT + A) and, ∂x #T " ∂(xT Ax) T = (A + AT )x. ∇x (x Ax) = ∂x We have   T ∂ w Ax  (w) + = wT A|w=x + xT AT |w=x = xT (A + AT ) ∂w 

Using the product rule, and parts (b) and (c):

 ∂(xT Ax) ∂ wT Ax  (x) = ∂x  ∂x

and thereby ∇x (xT Ax) =

Using the formula (5):

w=x

h

w=x

iT

∂(xT Ax) ∂x

= (A + AT )x.

We have xT Ax =

have xT Ax = Aℓℓ xℓ2 + xℓ

Pd Pd i=1

j=1

xi Aij xj . For any given index ℓ, we

X XX xi Aij xj . (Ajℓ + Aℓj )xj + j6=ℓ

i6=ℓ j6=ℓ

DIS2, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission.

8

Thus we have d

X X ∂xT Ax = 2Aℓℓ xℓ + (Ajℓ + Aℓj )xj = (Ajℓ + Aℓj )xj = ((AT + A)x)ℓ . ∂xℓ j=1 j6=ℓ

And consequently i ∂xT Ax h ∂xT Ax ∂xT Ax T = ∂x1 , ∂x2 , . . . , ∂x∂xdAx ∂x h i = ((AT + A)x)1 , ((AT + A)x)2 , . . . , ((AT + A)x)d = ((AT + A)x)T = x T ( A + AT ) , and hence ∇x (xT Ax) = (f) ∇x2 (xT Ax)

h

∂(xT Ax) ∂x

iT

= (A + AT )x.

Solution: We discuss two ways to solve this problem. Using the first principle:

We expand z(x) = ∇f (x) = (A + AT )x and find that z(x + ∆) = (A + AT )x + (A + AT )∆.

Relating with equation (13), we obtain that ∇2 f (x) = A + AT . Using the formula (12):

A straight forward computation yields that ∂ 2f = Aij + Aji ∂xi ∂xj

and hence ∇2 f (x) =

"

∂ 2f ∂xi ∂xj

#

= [(Aij + Aji )] = A + AT .

DIS2, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission.

9...