HW1 CIS520 - HW 1 solutions PDF

Title	HW1 CIS520 - HW 1 solutions
Author	Ashish Mehta
Course	Machine Learning
Institution	University of Pennsylvania
Pages	6
File Size	211.1 KB
File Type	PDF
Total Downloads	57
Total Views	144

Preview

CLICK TO PREVIEW PDF

Summary

HW 1 solutions ...

Description

CIS 520, Machine Learning, Fall 2018: Assignment 1 Ashish Mehta September 17, 2018 Collaborators: Srinath Rajagopalan

1

Conditional independence in probability models 1. We can write p(xi ) =

Pn

i=1

p(xi | zi = j) ∗ p(zi = j) =

Pk

j=1

fj (xi ) ∗ πj

2. The formula for p(x1 , . . . , xn ) is p(x1 ) ∗ p(x2 ) ∗ · · · ∗ p(xn ) given that x1 , x2 , . . . xn are independent of each other. p(x1 , . . . , xn ) = p(x1 ) ∗ p(x2 ) ∗ · · · ∗ p(xn ) n Y = p(xi ) =

i=1 n X n Y

(Since all xi are independent)

p(xi | zi = j ) ∗ p(zi = j )

i=1 i=1

=

k n X Y

fj (xi ) ∗ πj

i=1 j=1

3. The formula for p(zu = v | x1 , . . . , xn ) is: According to Bayes’ Theorem: p(zu = v | x1 , . . . , xn ) =

p(x1 . . . xn | p(zu = v)) ∗ p(zu = v) p(x1 , x2 , . . . , xn ) Since all xi are independent and all zi are independent, knowing the generating function of 1 point does not change the probability of observing other points.

Thus we have conditional independence, n Y p(x1 , x2 . . . xn | zu = v) = p(xi | zu = v) i=1

p(xi | zu = v) =p(xi )for i 6= u Qn (p(xi | zu = v)) ∗ πv p(zu = v | x1 , . . . , xn ) = i=1 p(x1 , x2 , . . . , xn ) Qn i=1,i6=u p(xi ) ∗ fv (xu ) ∗ πv = Qn i=1,i6=u p(xi ) ∗ p(xu ) fv (xu ) ∗ πv = Pk j=1 fj (xu )πj

1

2

Non-Normal Norms 1. For the given vectors,the point closest to x1 under each of the following norms is a) L0 : x4 with distance = 2.0 b) L1 : x3 with distance = 1.2 c) L2 : x2 with distance = 0.7874 d) Linf : x2 with distance = 0.6 2. Draw the 1-Nearest Neighbor decision boundaries with the given norms and lightly shade the o region:

3

a) L0

b) L1

c) L2

d) Linf

Decision trees 1. Concrete sample training data. (a) The sample entropy H (Y ) is −p(+) ∗ log2 p(+) − p(−) ∗ log2 p(−). H (Y ) = − p(+) ∗ log2 p(+) − p(−) ∗ log2 p(−) 16 14 14 16 = − ∗ log2 ( ) ∗ log2 ( ) − 30 30 30 30 = 0.9968

2

(b) The information gains are IG(X1 ) = 0.0114 and IG(X2 ) = 0.0487. IG(X1 ) = H (Y ) − [p(X1 = T ) ∗ H (Y | X1 = T ) + p(X1 = F ) ∗ H (Y | X1 = F )] IG(X1 ) = H (Y ) − [p(X1 = T ) ∗ [−p(+ | X1 = T ) ∗ log2 p(+ | X1 = T ) − p(− | X1 = T ) ∗ log2 p(− | X1 = T )] + p(X1 = F ) ∗ [−p(+ | X1 = F ) ∗ log2 p(+ | X1 = F ) − p(− | X1 = F ) ∗ log2 p(− | X1 = F )]] h 13 h 7 ii 7 10 7 i 17 h 10 7 6 6 ∗ log2 ( ) ∗ log2 ( ) − ∗ − ∗ log2 ( ) + ∗ log2 ( ) − ∗ − IG(X1 ) = 0.9968 − 17 17 17 30 13 13 13 13 30 10 IG(X1 ) = 0.9968 − 0.9854 IG(X1 ) = 0.0114 IG(X2 ) = H (Y ) − [p(X2 = T ) ∗ H (Y | X2 = T ) + p(X2 = F ) ∗ H (Y | X2 = F )] IG(X2 ) = H (Y ) − [p(X2 = T ) ∗ [−p(+ | X2 = T ) ∗ log2 p(+ | X2 = T ) − p(− | X2 = T ) ∗ log2 p(− | X2 = T )] + p(X2 = F ) ∗ [−p(+ | X2 = F ) ∗ log2 p(+ | X2 = F ) − p(− | X2 = F ) ∗ log2 p(− | X2 = F )]] h 11 h 7 ii 7 12 7 i 19 h 12 7 4 4 IG(X2 ) = 0.9968 − ∗ log2 ( ) ∗ log2 ( ) − ∗ − ∗ log2 ( ) + ∗ log2 ( ) − ∗ − 19 17 19 30 11 11 11 11 30 19 IG(X2 ) = 0.9968 − 0.9481 IG(X2 ) = 0.0487

(c) Since IG(X2 ) > IG(X1 ) we first split the data using X2 . For the second split, we consider X1 as that is the only remaining feature. To show that splitting on X1 after X2 is not redundant we calculate the IG(X1 | X2 = T ) and IG(X1 | X2 = F ).

IG(X1 | X2 = T ) = H (Y | X2 = T ) − [p(X1 = T | X2 = T ) ∗ H (Y | X1 = T , X2 = T ) + p(X1 = F | X2 = T ) ∗ H (Y | X1 = F, X2 = T )] h 5 i 5 1 7 i h6 h 1 7 4 4 ∗ − ∗ log2 ( ) − ∗ log2 ( ) ∗ log2 ( ) ∗ ∗ log2 ( ) − = − 6 6 6 11 11 11 11 11 6 2 ii 2 3 5 h 3 ∗ − ∗ log2 ( ) − ∗ log2 ( ) + 5 5 5 11 5 = 0.9457 − 0.7959 = 0.1498

IG(X1 | X2 = F ) = H (Y | X2 = F ) − [p(X1 = T | X2 = F ) ∗ H (Y | X1 = T , X2 = F ) + p(X1 = F | X2 = F ) ∗ H (Y | X1 = F, X2 = F )] h 12 2i 2 5 7 i h7 h 5 7 12 = − ∗ − ∗ log2 ( ) − ∗ log2 ( ) ∗ log2 ( ) ∗ ∗ log2 ( ) − 7 7 7 19 19 19 19 19 7 12 h 5 ii 5 7 7 + ∗ log2 ( ) ∗ log2 ( ) − ∗ − 12 12 12 19 12 = 0.9495 − 0.9369 = 0.0126 Since these IG values are non-zero, the outcome is still not totally certain. We calculate the majority vote of the labels of each leaf to arrive at the final decision. The decision tree that would be learned is shown in Figure 1.

3

Figure 1: The decision tree that would be learned. 2. Information gain and KL-divergence. (a) If variables X and Y are independent, is IG(x, y) = 0? If yes, prove it. If no, give a counter example.

IG(x, y) = −

XX x

p(x, y) ∗ log

y

 p(x) ∗ p(y)  p(x, y)

Since x and y are independent, p(x, y) = p(x) ∗ p(y), Therefore,  p(x) ∗ p(y)  XX p(x, y) ∗ log IG(x, y) = − p(x) ∗ p(y) x y XX IG(x, y) = − p(x, y) ∗ log(1) x

y

IG(x, y) = 0

(b) Proof that IG(x, y) = H [x] − H [x | y] = H [y] − H [y | x], starting from the definition in terms of KL-divergence: IG(x, y ) = KL (p(x, y )||p(x)p(y))  p(x) ∗ p(y)  XX = − p(x, y) ∗ log p(x, y) x y  p(x, y)  XX = p(x, y) ∗ log p(x) ∗ p(y) x y  p(x, y) X X XX = − p(x, y) ∗ log p(x, y) ∗ log(p(x)) p(y) x x y y X  XX X = p(y) ∗ p(x | y) ∗ log(p(x | y)) − log(p(x)) ∗ p(x, y) x

y

= −

X y

x

p(y) ∗



−

X x

= − H [x | y] + H [x]

y

 X p(x | y) ∗ log(p(x | y) − p(x) ∗ log(p(x)) y

= H [x] − H [x | y] 4

4

High dimensional hi-jinx 1. Intra-class distance. E[(X − X ′ )2 ] = E[X 2 ] − 2 ∗ E[X ∗ X ′ ] + E[X ′2 ] E[(X − X ′ )2 ] = E[X 2 ] − 2 ∗ E[X ] ∗ E ∗ [X ′ ] + E[X ′2 ] =

µ21

= 2σ

2

+σ −

2µ12 +

µ12 +

σ

(Since X and X’ are independent)

2

2

2. Inter-class distance. E[(X − X ′ )2 ] = E[X 2 ] − 2 ∗ E[X ∗ X ′ ] + E[X ′2 ] E[(X − X ′ )2 ] = E[X 2 ] − 2 ∗ E[X ] ∗ E ∗ [X ′ ] + E[X ′2 ] =

µ21

2

+ σ − 2µ1 µ2 + 2

= (µ1 − µ2 ) + 2σ

µ22 +

σ

(Since X and X’ are independent)

2

2

3. Intra-class distance, m-dimensions. m m X X E[ (Xj − X j′ )2 ] = E[Xj2] − 2E[Xj ] ∗ E[Xj′ ] + E[X ′j2 ] j=1

j=1

=

m X

2 2 µ1j + σ 2 − 2µ1j + µ21j + σ 2

j=1

=

m X

2σ 2

j=1

= 2mσ 2

4. Inter-class distance, m-dimensions. m m X X E[ (Xj − X j′ )2 ] = E[Xj2] − 2E[Xj ] ∗ E[Xj′ ] + E[X ′j2 ] j=1

j=1

= =

m X

2 µ1j + σ 2 − 2µ1j µ2j + µ22j + σ 2

j=1 m  X

(µ1j − µ2j )2 + 2σ 2

j=1



2mσ 2 as all µ1j and (µ11 − µ21 ) + 2mσ 2 µ2j other than µ11 and µ21 are zero. This ratio signifies how far apart are intra-class points compared to inter-class points in a given dimension, such that all other dimensions are assumed equidistant. 2σ 2 . Diving the whole ratio by m, we get,  µ − µ  21 11 + 2σ 2 m As m increases towards ∞, this ratio approaches 1. A ratio of 1 signifies that as the number of equidistant dimensions m grows to ∞, the inter-class and intra-class distance becomes equal and makes it harder for the KNN to classify between the classes.

5. The ratio of expected intra-class distance to inter-class distance is:

5

5

Fitting distributions with KL divergence

KL divergence for Gaussians.  (x − µ1 )2 (x − µ2 )2  1. The KL divergence between two univariate Gaussians is given by f = − and + 2 2σ 2 g = − log σ. h p(x)i K L(p(x)||q(x)) = Ep log q(x) (x − µ1 )2 # " − 2σ 2 e = Ep log (x − µ2 )2 − 2 σ∗e (x − µ2 )2 (x − µ1 )2 −  −  2 2σ 2 − log e = Ep log e − log σ  (x − µ1 )2 (x − µ2 )2  + = Ep − − log σ 2 2σ 2 = Ep [f (x, µ1 , µ2 , σ )] + g(σ ) 2. The value µ1 = µ2 minimizes KL(p(x)||q(x)). ∂K L(p(x)||q(x)) ∂µ1   (x − µ1 )2  (x − µ2 )2  ∂ Ep − + − log σ 2 2σ 2 0= ∂µ1   Ep (x2 ) + Ep (µ22 ) − 2Ep (x)µ2 (x − µ1 )2 − log σ + ∂ − Ep 2 2 2σ 0= ∂µ1   ∂[ 21 − 1 + µ21 + σ 2 + µ22 − 2µ1 µ2 ] − log σ 0= ∂µ1   ∂[ 21 (µ1 − µ2 )2 + σ 2 − 1] − log σ 0= ∂µ1 0 = µ1 − µ2

0=

µ1 = µ2 The minimum value of K L(p(x)||q(x)) is, 1 (µ2 − µ2 )2 + σ 2 − 1] − log σ 2 1 = [σ 2 − 1] − log σ 2 =

6...