Machine Learning for Graphs and Sequential Data Endterm Exam Solutions PDF

Title	Machine Learning for Graphs and Sequential Data Endterm Exam Solutions
Course	Machine Learning for Graphs and Sequential Data
Institution	Technische Universität München
Pages	14
File Size	441.1 KB
File Type	PDF
Total Downloads	256
Total Views	305

Preview

CLICK TO PREVIEW PDF

Summary

Sample SolutionData Analytics and Machine Learning Group Department of Informatics Technical University of Munich####### EsolutionPlace student sticker hereNote: - During the attendance check a sticker containing a unique code will be put on this exam. - This code contains a unique number that assoc...

Description

Data Analytics and Machine Learning Group Department of Informatics Technical University of Munich

Note:

Esolution

• During the attendance check a sticker containing a unique code will be put on this exam. • This code contains a unique number that associates this exam with your registration number. • This number is printed both next to the code and to the signature field in the attendance check list.

Place student sticker here

Wednesday 5th August, 2020 11:30 – 12:45

Date: Time:

tio

IN2323 / Endterm Prof. Dr. Stephan Günnemann

lu

Exam: Examiner:

n

Machine Learning for Graphs and Sequential Data

Working instructions

So

• This exam consists of 14 pages with a total of 10 problems. Please make sure now that you received a complete copy of the exam. • The total amount of achievable credits in this exam is 43 credits. • Detaching pages from the exam is prohibited. • Allowed resources:

e

– all materials that you will use on your own (lecture slides, calculator etc.) – not allowed are any forms of collaboration between examinees and plagiarism

Sa m pl

• You have to sign the code of conduct.

• Make sure that the QR codes are visible on every uploaded page. Otherwise, we cannot grade your exam. • Only write on the provided sheets, submitting your own additional sheets is not possible. • Last two pages can be used as scratch paper.

• All sheets (including scratch paper) have to be submitted to the upload queue. Missing pages will be considered empty. • Only use a black or blue color (no red or green)! • Write your answers only in the provided solution boxes or the scratch paper. • For problems that say "Justify your answer" you only get points if you provide a valid explanation. • For problems that say "Prove" you only get points if you provide a valid mathematical proof.

• If a problem does not say "Justify your answer" or "Prove" it’s sufficient to only provide the correct answer. • Exam duration - 75 minutes.

Left room from

to

/

– Page 1 / 14 –

Early submission at

Problem 1

Normalizing Flows (4 credits)    z1  z1 (|z2 | + 1) from R2 to R2 . We consider two transformations f1 (z) = 1/3 and f2 (z) = z z2   x1   2 x1 −1 |x2 |+1 and f . (x) = The respective inverse transformation are 1f −1(x) = 3 2 x2

x2

The respective Jacobians are

Jf − 1 = 1

0 − 23

1 z 3 2

0



1 0

0 3x22

#

Jf 2 =





Jf − 1 = 2

|z2 | + 1

sign(z2 )z1

0

1

"

1 |x2 |+1

−sign(x2 )x1 (|x2 |+1)2

0

1

 #



1 . 2

tio

We assume a Gaussian base distribution p1 (z) = N (0, I). We observed one point x1 =



We propose to stack the transformations f1 , f2 to transform the base distribution p1 in the distribution p2 with normalizing flows. Compute the likelihood for x under the transformed distributionp2 if the order of transformations is f1 followed by f2 . Hint: You might use the density of the unit variate Gaussian p = N (0, 1) at the following points: p(1/2) = 0.3521, p(1/3) = 0.3774, p (1/9) = 0.3965, p (5) = 1.4867e −06 , p (8) = 5.0523e −15 , p (10) = 7.6946e −23

lu

0 1 2 3 4

1

n

Jf 1 =

"

We consider f1 · f2 and compute: 



1 3

So

f2−1(x) =

2

, f1−1(f2−1(x)) =



1 3

8



Sa m pl

e

By using the determinant of the Jacobians, we apply the change of variable formula:  1  1 × (3 × 22 ) p2 (x) = p1 ( 3 ) × |2| + 1 8  1  = p1 ( 3 ) × 4 8

= 7.6266e −15

Version 2: We consider f2 · f1 and compute:

f1−1(x)

=



1 8



, f2−1(f1−1(x))

=



1 9

8



By using the determinant of the Jacobians, we apply the change of variable formula:  1  1 p2 (x) = p1 ( 9 ) × (3 × 22 ) × 8 |8| + 1  1  4 = p1 ( 9 ) × 8

= 2.6709e

3

−15

– Page 2 / 14 –

Problem 2

Variational Inference (5 credits)

We are performing variational inference in some latent variable modelpθ (x, z) using the following family of variational distributions Q1 = {N (z |φ, 1) : φ ∈ R}. a) Assume that the variational distribution q ∈ Q1 is fixed, and we are trying to maximize the ELBO w.r.t. θ using gradient ascent. Is it necessary to use the reparametrization trick in this case? If yes, explain how to do it for our family of distributions Q1 ; if not, provide a justification.

0 1 2 3

So

lu

tio

n

No, we don’t need to use the reparametrization trick when optimizing only w.r.t. θ (i.e., when q is fixed). We can approximate the gradient of the ELBO w.r.t.θ simply using Monte Carlo, and the reparametrization trick is not necessary (see slide 75 of Lecture 2).

b) Consider another family of distributionsQ2 = {N (z |0, s 2 ) : s ∈ (0, ∞)}. Which of the following statements is true? Justify your answer. 1. maxθ,q∈Q1 ELBO(θ , q) < maxθ,q∈Q2 ELBO(θ , q) 2. maxθ,q∈Q1 ELBO(θ , q) = maxθ,q∈Q2 ELBO(θ , q)

e

3. maxθ,q∈Q1 ELBO(θ , q) > maxθ,q∈Q2 ELBO(θ , q)

Sa m pl

4. It’s impossible to tell without additional information.

It’s impossible to tell without additional information about pθ (x, z). For example, if the true posterior isN (z |µ, 1) for some µ ∈ R, then (3) will hold. As another example, if the true posterior is N (z |0, σ 2 ) for some σ ∈ (0, ∞), then (1) will hold. This means that it’s impossible to tell which one is the case without additional information.

– Page 3 / 14 –

0 1 2

Problem 3

Robustness of Machine Learning Models (6 credits)

Suppose we have trained a binary classifier f : Rd → {0, 1} and want to certify its robustness via randomized smoothing. Therefore, the smoothed classifier gσ 2 (x) = E [I [f (x + ε) = 1]], where ε ∼ N (0, σ 2 I). (z) Φ1 (gσ 2 (x )) is 1/σ-Lipschitz w.r.t. x and the L2 norm, where Φ Fact: − (CDF) of the standard normal distribution. 0 1 2 3 4

denotes the cumulative distribution function

Φ1 (gσ 2 (x)) , show that the largest certifiableL2 radius a) Using the above fact about the Lipschitz-continuity of − r around a sample x is identical to the result shown in the lecture. More precisely, show that r = σ− Φ1 (gσ 2 (x)).

−1

1

σ 1

tio

k− Φ1 (gσ 2 (x)) − − Φ1 (0.5)k2 ≤

n

−1 (z) = Hint: You may assume we can evaluate gσ 2 (x) in closed form and you may use the following results: limz →0 Φ −∞, − Φ1 (0.5) = 0, limz →1 − Φ1 (z) = ∞.

kx − x˜ k2

kx − x˜ k2 σ kx − x˜ k2 ≥ σ − Φ1 (gσ 2 (x))

b) A fellow student has a promising idea: by lettingσ → ∞ we can make the Lipschitz constant of the smoothed classifier arbitrarily small, leading to arbitrarily large certifiable radii, i.e. a very robust model. Is this a good idea? Why or why not?

Sa m pl

0 1 2

e

So

lu

Φ (gσ 2 (x)) ≤

No, since σ → ∞ means that the signal-to-noise ratio of the smoothed examples approaches zero. Similar, also valid arguments: • For σ → ∞ we sample from very large regions of the input space, for which a large fraction of samples belongs to other classes ⇒ we don’t get certificates. • σ → ∞ means we introduce so much noise that we lose accuracy / performance of the smoothed classifier.

– Page 4 / 14 –

Problem 4

Markov Property (3 credits)

We consider the following sequences of random variables U0 , U1 , ... , Ut .   Xt a) Ut = where Xt are observed variables and Zt are latent variables of an Hidden Markov Model. Does the Zt

sequence of variables Ut fulfill the Markov property i.e. P(Ut |Ut −1 ) = P(Ut |Ut −1 , ..., U0 ) ? Justify your answer.

0 1

Yes, since Xt and Zt are conditionally independent Xt −2 , Zt −2 , ..., X0 , Z0 of given Zt −1 we can write:

tio

n

P (Xt , Zt |Xt −1 , Zt −1 , ..., X0 , Z0 ) = P (Xt , Zt |Xt −1 , Zt −1 )

We consider an AR(p) process Xt and compute:

0 1

lu

b) We consider an AR(p) process Xt . Under what condition on p and k does the sequence of variables Ut = [Xt −1 , ..., Xt −k ] fulfill the Markov property i.e. P(Ut |Ut −1 ) = P(Ut |Ut −1 , ..., U0 ) ? Justify your answer.

P (Xt −1 , ..., Xt −k |Xt −2 , ..., X0 ) = P (Xt −1 |Xt −1 , ..., Xt −p−1 )

So

This quantity is equal to P(Xt −1 , ..., Xt −k |Xt −2 , ..., Xt −k −1 ) = P(Xt −1 |Xt −2 , ..., Xt −k −1 ) iff p ≤ k .

e

c) We consider a recurrent neural network which producesXt . Does the sequence of variables Ut = Xt fulfill the Markov property i.e. P(Ut |Ut −1 ) = P(Ut |Ut −1 , ..., U0 ) ? Justify your answer.

Sa m pl

No, both variables Xt and Xt −2 depend on the hidden state at time t − 2 given Xt −1 .

– Page 5 / 14 –

0 1

Problem 5 0 1 2 3

Markov Chain (3 credits)

We consider a Markov chain Xt in {1, C } with parameters π , A . We assume we observed the sequence Sk = [v0 , ... , v0 , v1 , ... , v1 , ... , vT , ... , vT ] where each value is observed k times. The parameter k can be seen | {z } | {z } | {z } k times

k times

k times

as a discretization parameter of the time space.

Compute the likelihood of the sequence under the parametersπ , A i.e. Pπ ,A (Sk ). What happens to this quantity if you increase the discretization parameter from k to k ′ > k but keep the same model parameter π , A ?

TY −1

Avt ,vt +1 ×

t =0

T Y

1 Avkt − ,vt .

0

tio

Pπ ,A (Sk ) = πv0 ×

Sa m pl

e

So

lu

Since the parameters Ai,j < 1 the likelihood decreases when k increases.

– Page 6 / 14 –

n

We compute the likelihood:

Problem 6

Temporal Point Process (6 credits)

Consider an inhomogeneous Poisson process (IPP) on the interval [0, 4] with the intensity function ( a if t ∈ [0, 3] λ(t ) = b if t ∈ (3, 4] where a > 0, b > 0 are some positive parameters. a) Assume that you observed a sequence of events{0.2, 1.0, 1.5, 2.9, 3.1, 3.8} generated by the above IPP. What is the maximum likelihood estimate of the parameters a and b ?

N X

log λ(ti ) −

Z

T

λ(t )dt

0

i=1

Plugging in our values {0.2, 1.0, 1.5, 2.9, 3.1, 3.8}, we obtain = 4 log a + 2 log b − 3a − b

tio

log p({t1 , ..., tN }) =

n

The log-likelihood of a TPP realization {t1 , ..., tN } is

0 1 2 3 4

lu

To maximize w.r.t. a and b , we can compute the derivatives, set them to zero and solve for a and b . ∂ 4 ! log p({t1 , ..., tN }) = − 3 = 0 ∂a a 4 3

So

⇒ a⋆ =

Sa m pl

e

∂ 2 ! log p({t1 , ..., tN }) = − 1 = 0 ∂b b ⇒ b⋆ = 2

b) Assume that a = 1 and b = 5. What is the expected number of events generated by the IPP in this case? We know from the properties of Poisson process that the expected number of events equals to Z

T

λ(t )dt =

0

=

Z

4

λ(t )dt

0

Z

3

1dt + 0

=3+5

Z

=8

– Page 7 / 14 –

3

4

5dt

0 1 2

Problem 7 0 1 2 3 4

Clustering with the Planted Partition Model (4 credits)

The following graph has been generated from a planted partition model with in-community edge probabilityp and between-community edge probability q.

3 2

n

4

1

lu

tio

5

7

So

6

10

9

e

8

Sa m pl

Assuming p < q, find the maximum likelihood community assignments under a PPM. Give your solution as two sets of node labels making up the two discovered communities. Justify your answer. q(1−p) The likelihood of graph under a planted partition model is proportional to a p(1 −q) to the power of the induced cut size of the partitioning. If p < q, this fraction is greater than 1 and the likelihood is maximized by the maximum cut size. Therefore, the two clusters are made up of the nodes {2, 3, 7, 8} and {1, 4, 5, 6, 9, 10}.

– Page 8 / 14 –

Problem 8

PageRank in a Wheel (6 credits)

1 8

2

7

0

3

4 5

tio

Figure 8.1: Example of a directed wheel graph with n + 1 = 9 nodes

n

6

Consider a directed graph of size n + 1 with a cycle of n nodes and an additional central node that every other node connects to (see figure). So we have a graph with node set V = {0, 1, ... , n } and edge set

lu

E = {(i, i + 1) | i ∈ {1, ... , n − 1}} ∪ {(n, 1)} ∪ {(i, 0) | i ∈ {1, ... , n }} .

We want to compute the PageRank scores with a link-follow probability ofβ (a teleport probability of 1 − β) and Pn some arbitrary teleport vector π, i=0 πi = 1. Note that we index π from 0 to n . We define the predecessor function pa as the index of the predecessor of a node in the directed cycle, i.e. and

pa(i) = i − 1 ∀i ∈ {2, ... , n } .

So

pa(1) = n

You can write pak (i) for the k -th predecessor of node i , i.e. pa3 (i) = pa(pa(pa(i))) and pa0 (i) = i . a) Set up the PageRank equations for all nodes in scalar form, i.e. each ri separately instead of matrix form.

e

The nodes 1 through n have exactly 2 outgoing edges, so their degree is 2 and their PageRank equations are rpa(i ) + (1 − β )πi 2

Sa m pl

ri = β ·

0 1 2

∀i ∈ {1, ... , n }

The central node has incoming edges from all other nodes and therefore the following PageRank equation. ! n X ri + (1 − β )π0 r0 = β · i=1

2

b) Why is this graph problematic for PageRank without random teleportation (β = 1)? The central node has no outgoing edges and is a dead end. A random walker would be stuck there and any PageRank would be “lost”.

– Page 9 / 14 –

0 1

0 1 2 3

c) Show that the PageRank for node i ∈ {1, ... , n } in the outer cycle is given by ri =

 n −1 (1 − β ) X  β j πpaj (i) .  n 2 1− β 2

j=0

In the formula of some cycle node i , we can plug in the formula for the predecessor nodes repeatedly and get

= =

2

β 2

rpa(i) + (1 − β )πi



β 2

rpa(pa(i)) + (1 − β )πpa(i)

 2  β β 2

2

 3 β 2



rpa(pa(pa(i))) + (1 − β )πpa(pa(i ))

rpa(pa(pa(i))) +

= ...

 2 β 2

 2 β

+ (1 − β )πi =



+

2

β 2

(1 − β )πpa(pa(i)) +

rpa(pa(i)) +

β 2

(1 − β )πpa(i) + (1 − β )πi

(1 − β )πpa(i) + (1 − β )πi

β 2

n

=

β

tio

ri =

(1 − β )πpa(i) + (1 − β )πi

Repeating this process for k steps expresses the rank of node i in terms of its k -th predecessor.

2

rpak (i) + (1 − β )

So

2

ri + (1 − β )

n −1  j X β j=0

2

πpaj (i)

2

πpaj (i )

n −1  j (1 − β ) X β ⇔ ri = πpaj (i )  n 2 1 − β2 j=0

Sa m pl

e

ri =

X β j j=0

But node i is its own n -th predecessor, so we get  n β

k −1 

lu

ri =

 k β

– Page 10 / 14 –

Problem 9

Label Propagation (4 credits) 0 1 2 3 4

Consider the following graph

0

a

a

y

x

5

1

n

The nodes labeled 0 and 1 are observed and from class 0 and 1, respectively. One edge has a fixed weight, the other two have a variable edge weight ofa ≥ 0 . The two center nodes are unobserved and we call their labelsx and y . We want to predict classes for the two center nodes that minimize the Label Propagation objective exactly,

tio

 2 1X Wij yi − yj 2 ij

where W is the weighted adjacency matrix and yi , yj are the labels of the nodes.

lu

Find the set of all possible edge weights a that guarantee that node x is assigned to class 0. Justify your answer. If we instantiate the objective for this graph (disregarding constant factors), we get

So

5(x − 1)2 + a ∗ (x − y )2 + a (y − 0)2 = (5 + a)x 2 − 10x + 5 − 2axy + 2ay 2 .

Since x and y are restricted to be either 0 or 1, we can just look at all cases. x = 0, y = 0 ⇒ 5

x = 0, y = 1 ⇒ 5 + 2a

x = 1, y = 0 ⇒ a

x = 1, y = 1 ⇒ a

Sa m pl

e

If x = 0, setting y = 1 always increases the cost, so the assignment x = 0, y = 1 will never be made. In the case of x = 1, the value of y is irrelevant. In conclusion, we assign x = 0 if 5 < a and (5, ∞) is the set we are looking for. Alternatively : In Label Propagation with two classes, the exact solution is given by the minimum cut between the sets of labeled nodes. For two unlabeled nodes, there are 4 possible cuts corresponding to the 4 possible assignments of x and y above. By the same reasoning as above, we arrive at the same solution.

– Page 11 / 14 –

Problem 10

Adversarial Attacks on Graph Neural Networks (2 credits)

Suppose you are given the following two-layer graph neural network.     f (A, X) = Z = Softmax Aˆ ReLU Aˆ XW1 W2

X ∈ RN×D are the node features, Z are the node predictions, Wx are weight matrices of appropriate dimensions 1 1 and Aˆ = D˜ − 2 A˜ D˜ −2 is the propagation matrix as defined for GCNs. Here,A˜ = A +PI , where A is the adjacency ˜ ij . matrix and I is the identity matrix, and D˜ is a diagonal matrix of node degrees D˜ ii = j A The model was trained for the task of semi-supervised node classification, and we want to predict a classc for node 6 in the following graph A : 4

3

0 1

7

So

6

5

lu

2

tio

n

1

a) An advers...