HW3 2020 CS246 Solutions PDF

Title	HW3 2020 CS246 Solutions
Author	Paul Caron
Course	Mining Massive Datasets
Institution	Stanford University
Pages	9
File Size	154.1 KB
File Type	PDF
Total Downloads	9
Total Views	159

Preview

CLICK TO PREVIEW PDF

Summary

hw3solution...

Description

CS246: Mining Massive Data Sets

Winter 2020

Problem Set 3 Please read the homework submission policies at http://cs246.stanford.edu.

1

Dead ends in PageRank computations (25 points)

Let the matrix of the Web M be an n-by-n matrix, where n is the number of Web pages. The entry mij in row i and column j is 0, unless there is an arc from node (page) j to node i. In that case, the value of mij is 1/k, where k is the number of arcs (links) out of node j . Notice that if node j has k > 0 arcs out, then column j has k values of 1/k and the rest 0’s. If node j is a dead end (i.e., it has zero arcs out), then column j is all 0’s. Let r = [r1 , r2 , . . . , rn ]T be (an estimate of ) the PageRank vector; that is, ri is the estimate of the PageRank of node i. Define w(r) to be the sum of the components of r; that is P w(r) = ni=1 ri . ′ In one iteration of the PageRank algorithm, we compute the Pn next estimate r of the′ PageRank ′ ′ M r . Define w(r ) to be the as: r = Mr. Specifically, for each i we compute Pn ′ ri = j=1 ij j ′ ′ sum of components of r ; that is w(r ) = i=1 ri .

You may use D (the set of dead nodes) in your equation. (a) [6pts]

Suppose the Web has no dead ends. Prove that w(r′ ) = w(r). Pn Pn Pn ⋆ SOLUTION: We need to show i=1 order of sumM r = ij j i=1 ri . Interchange P j=1 P Pn n n mations so we have j=1 ( i=1 Mij )rj on the left. Since there are no dead ends, i=1 Mij = 1 for each j. Thus the equation holds.

(b) [9pts] Suppose there are still no dead ends, but we use a teleportation probability of 1−β where we teleportP to a random node (0 < β < 1). The expression for the next estimate of ri becomes n ri′ = β( j=1 Mij rj ) + (1 − β)/n. Under what circumstances will w(r′ ) = w(r)? Prove your conclusion.

CS 246: Mining Massive Data Sets - Problem Set 3

2

Pn Pn ⋆ SOLUTION: If and only if w(r) = 1. To see why, sum over I to get w(r′ ) = i=1 Mij rj + β j=1 P n right sums to 1 − β, and using the reasoning from part i=1 (1 − β)/n. The second term on theP (a), the first term on the right sums to β nj=1 rj . That is: w(r′ ) = βw(r) + 1 − β. If w(r) = 1, then the equation tells us w(r′ ) is also 1. Conversely, if w(r′ ) = w(r) = x, then x must satisfy the equation x = βx + 1 − β, from which it follows that x = 1.

(c) [10pts] Now let us assume there are one or more dead ends. Call a node “dead” if it is a dead end and “live” if not. At each iteration, we teleport from live nodes with probability 1 − β and teleport from dead nodes with probability 1. In both cases, we choose a random node uniformly to teleport to. Assume w(r) = 1. Write the equation for ri′ in terms of β, M, r, n, and D (where D is the set of dead nodes). Then, prove that w(r′ ) is also 1. Pn P ⋆ SOLUTION: The equation is ri′ = β j=1 Mij rj + (1 − β)/n + (β/n) deadj rj If we sum over i P and use the same trickPof moving the summation on i to apply only to Mij , we get w(r′ ) = β livej rj + 1 − β + β deadj rj . The first and last terms on the right together give βw(r), which is β, since w(r) is assumed to be 1. Thus, the right side reduces to 1, as desired.

What to submit (i) Proof [1(a)] (ii) Condition for w(r′ ) = w(r) and Proof [1(b)] (iii) Equation for ri′ and Proof [1(c)]

2

Implementing PageRank and HITS (30 points)

In this problem, you will learn how to implement the PageRank and HITS algorithms in Spark. The general computation should be done in Spark, and you may also include numpy operations whenever needed. You will be experimenting with a small randomly generated graph (assume graph has no dead-ends) provided at graph-full.txt. There are 100 nodes (n = 100) in the small graph and 1000 nodes (n = 1000) in the full graph, and m = 8192 edges, 1000 of which form a directed cycle (through all the nodes) which ensures that the graph is connected. It is easy to see that the existence of such a cycle ensures that there are no dead ends in the graph. There may be multiple directed edges between a pair of nodes, and your solution should treat them as the same edge. The first

CS 246: Mining Massive Data Sets - Problem Set 3

3

column in graph-full.txt refers to the source node, and the second column refers to the destination node. Implementation hint: You may choose to store the PageRank vector r either in memory or as an RDD. Only the matrix M of links is too large to store in memory, and you are allowed to store matrix M in an RDD. e.g. data = sc.textFile(”graph−full.txt”). On an actual cluster, an RDD is partitioned across the nodes of the cluster. However, you cannot then M = data.collect() which fetches the entire RDD to a single machine at the driver node and stores it as an array locally. (a) PageRank Implementation [15 points] Assume the directed graph G = (V, E) has n nodes (numbered 1, 2, . . . , n) and m edges, all nodes have positive out-degree, and M = [Mji ]n×n is a an n × n matrix as defined in class such that for any i, j ∈ J1, nK:  1 if (i → j) ∈ E, deg(i) Mji = 0 otherwise. Here, deg(i) is the number of outgoing edges of node i in G. If there are multiple edges in the same direction between two nodes, treat them as a single edge. By the definition of PageRank, assuming 1 − β to be the teleport probability, and denoting the PageRank vector by the column vector r, we have the following equation: r=

1−β 1 + βMr, n

(1)

Based on this equation, the iterative procedure to compute PageRank works as follows: 1. Initialize: r(0) = n1 1 2. For i from 1 to k, iterate: r(i) =

1−β 1 n

+ βMr(i−1)

Run the aforementioned iterative process in Spark for 40 iterations (assuming β = 0.8) and obtain the PageRank vector r. In particular, you don’t have to implement the blocking algorithm from lecture. The matrix M can be large and should be processed as an RDD in your solution. Compute the PageRank scores and report the node id for the following using graph-full.txt: • List the top 5 node ids with the highest PageRank scores. • List the bottom 5 node ids with the lowest PageRank scores. For a sanity check, we have provided a smaller dataset (graph-small.txt). In that dataset, the top node has id 53 with value 0.036. Note that the graph-small.txt dataset is only provided for sanity check purpose. Your write-up should include results obtained using graph-full.txt (for both part (a) and (b)).

CS 246: Mining Massive Data Sets - Problem Set 3

4

⋆ SOLUTION: Top 1st node id = 263 Top 2nd node id = 537 Top 3rd node id = 965 Top 4th node id = 243 Top 5th node id = 285 Bottom Bottom Bottom Bottom Bottom

1st node id = 558 2nd node id = 93 3rd node id = 62 4th node id = 424 5th node id = 408

(b) HITS Implementation [15 points] Assume the directed graph G = (V, E) has n nodes (numbered 1, 2, . . . , n) and m edges, all nodes have non-negative out-degree, and L = [Lij ]n×n is a an n × n matrix referred to as the link matrix such that for any i, j ∈ J1, nK:  1 if (i → j) ∈ E, Lij = 0 otherwise. Given the link matrix L and some scaling factors λ, µ, the hubbiness vector h and the authority vector a can be expressed using the equations: h = λLa, a = µLT h

(2)

where 1 is the n × 1 vector with all entries equal to 1. Based on this equation, the iterative method to compute h and a is as follows: 1. Initialize h with a column vector (of size n × 1) of all 1’s. 2. Compute a = LT h and scale so that the largest value in the vector a has value 1. 3. Compute h = La and scale so that the largest value in the vector h has value 1. 4. Go to step 2. Repeat the iterative process for 40 iterations, assume that λ = 1, µ = 1 and then obtain the hubbiness and authority scores of all the nodes (pages). The link matrix L can be large and should be processed as an RDD. Compute the following using graph-full.txt: • List the 5 node ids with the highest hubbiness score.

CS 246: Mining Massive Data Sets - Problem Set 3

5

• List the 5 node ids with the lowest hubbiness score. • List the 5 node ids with the highest authority score. • List the 5 node ids with the lowest authority score. For a sanity check, you should confirm that graph-small.txt has highest hubbiness node id 59 with value 1 and highest authority node id 66 with value 1. ⋆ SOLUTION: Top 1th hubbiness node id = 840 Top 2th hubbiness node id = 155 Top 3th hubbiness node id = 234 Top 4th hubbiness node id = 389 Top 5th hubbiness node id = 472 Bottom Bottom Bottom Bottom Bottom

1th 2th 3th 4th 5th

Top Top Top Top Top

authority authority authority authority authority

1th 2th 3th 4th 5th

Bottom Bottom Bottom Bottom Bottom

1th 2th 3th 4th 5th

hubbiness hubbiness hubbiness hubbiness hubbiness

node node node node node

node node node node node

authority authority authority authority authority

id id id id id

= = = = =

node node node node node

id id id id id

= = = = =

23 835 141 539 889

893 16 799 146 473

id id id id id

= = = = =

19 135 462 24 910

What to submit (i) List 5 node ids with the highest and least PageRank scores [2(a)] using graph-full.txt (ii) List 5 node ids with the highest and least hubbiness and authority scores [2(b)] using graph-full.txt (iii) Upload all the code via Gradescope [2(a) & 2(b)]

CS 246: Mining Massive Data Sets - Problem Set 3

3

6

Clique-Based Communities (25 points)

Imagine an undirected graph G with nodes 2, 3, 4, . . . , 1000000. (Note that there is no node 1.) There is an edge between nodes i and j if and only if i and j have a common factor other than 1. Put another way, the only edges that are missing are those between nodes that are relatively prime; e.g., there is no edge between 15 and 56. We want to find communities by starting with a clique (not a bi-clique) and growing it by adding nodes. However, when we grow a clique, we want to keep the density of edges at 1; i.e., the set of nodes remains a clique at all times. A maximal clique is a clique for which it is impossible to add a node and still retain the property of being a clique; i.e., a clique C is maximal if every node not in C is missing an edge to at least one member of C . Let Ci be the set of nodes of G that are divisible by i, where i is a positive integer. (a) [5 points] Prove that Ci is a clique for any i. ⋆ SOLUTION: If there are fewer than two nodes in Ci , then Ci is a clique by definition. Otherwise, any two nodes in Ci have i as a common factor and therefore have an edge between them.

(b) [10 points] Under what circumstances is Ci a maximal clique? Prove that your conditions are both necessary and sufficient. (Trivial conditions, like “Ci is a maximal clique if and only if Ci is a maximal clique,” will receive no credit.) ⋆ SOLUTION: i must be a prime less than or equal to one million. If i is greater than one million, then Ci is the empty clique, and adding any node to Ci will produce a 1-clique. Thus Ci is not maximal. If i is less than or equal to one million, but isn’t prime, let j be a factor of i, with 1 < j < i. Node j is not in Ci , yet it has an edge to every member of Ci , because it has j as a common factor. Thus, Ci is not maximal. Conversely, if i is prime, and less than or equal to one million, then there is no node outside Ci that has an edge to the node i itself. To see why, suppose j were such a node. Then i and j have a common factor other than 1, which can only be i, since i is prime. However, if j has i as a factor, then j is a multiple of i and therefore already in Ci .

CS 246: Mining Massive Data Sets - Problem Set 3

7

(c) [10 points] Prove that C2 is the unique largest clique. That is, it has more elements than any other clique. (Note: Not all cliques are in the form of Ci ) ⋆ SOLUTION: i and i + 1 are always relatively prime. This is because if they had a common factor p > 1, then (i + 1) − i = 1 must also be divisible by p, which is impossible for any p > 1. Since i and i + 1 are always relatively prime, only one of {2, 3}, {4, 5}, {6, 7}, . . . can be in a clique. Therefore, the largest possible clique has 500, 000 members, as does C2 . Conversely, there are no cliques other than C2 that have 500, 000 members. Any clique with 500, 000 members must contain either 2 or 3, because otherwise (by the previous argument) the clique must contain at most 499, 999 members. If it contains 3, then all other elements in the clique must have 3 as a factor (since 3 is prime), and there are fewer than 499, 999 elements that are divisible by 3. Note that any solution that implicitly assumes that all cliques are of the form Ci is incorrect. As a simple example, consider the clique (2 ∗ 3, 2 ∗ 5, 3 ∗ 5); there is an edge between any two of the nodes, but there is no number that divides all three of the nodes. An example of an incorrect answer is “by part b, all maximal cliques must be of the form Ci , where i is prime; since C2 is larger than C3 , C5 , etc., it is the unique largest clique.”

What to submit (i) Proof that the specified nodes are a clique. (ii) Necessary and sufficient conditions for Ci to be a maximal clique, with proof. (iii) Proof that C2 is the unique largest clique.

4

Dense Communities in Networks (20 points)

In this problem, we study the problem of finding dense communities in networks. Definitions: Assume G = (V, E) is an undirected graph (e.g., representing a social network). • For any subset S ⊆ V , we let the induced edge set (denoted by E[S]) to be the set of edges both of whose endpoints belong to S . • For any v ∈ S, we let degS (v) = |{u ∈ S|(u, v) ∈ E}|.

CS 246: Mining Massive Data Sets - Problem Set 3

8

• Then, we define the density of S to be: ρ(S) =

|E[S ]| . |S|

• Finally, the maximum density of the graph G is the density of the densest induced subgraph of G, defined as: ρ∗ (G) = max{ρ(S)}. S⊆V

Goal. Our goal is to find an induced subgraph of G whose density is not much smaller than ρ∗ (G). Such a set is very densely connected, and hence may indicate a community in the network represented by G. Also, since the graphs of interest are usually very large in practice, we would like the algorithm to be highly scalable. We consider the following algorithm: Require: G = (V, E) and ǫ > 0 ˜ S←V S, while S 6= ∅ do A(S) := {i ∈ S | degS (i) ≤ 2(1 + ǫ)ρ(S)} S ← S \ A(S) ˜ then if ρ(S) > ρ(S) S˜ ← S end if end while return S˜ The basic idea in the algorithm is that the nodes with low degrees do not contribute much to the density of a dense subgraph, hence they can be removed without significantly influencing the density. We analyze the quality and performance of this algorithm. We start with analyzing its performance. (a) [10 points] We show through the following steps that the algorithm terminates in a logarithmic number of steps. ǫ |S|. i. Prove that at any iteration of the algorithm, |A(S)| ≥ 1+ǫ

ii. Prove that the algorithm terminates in O(log1+ǫ (n)) iterations, where n is the initial number of nodes.

CS 246: Mining Massive Data Sets - Problem Set 3

9

P P ⋆ SOLUTION: We have: 2|E[S]| = v∈S degS (v) ≥ v∈S\A(S) degS (v) > |S \A(S )|2(1+ ǫ |S |. At each iteration, the ǫ)ρ(S ). Hence, |S \ A(S )| < |S|/(1 + ǫ), or equivalently |A(S)| > 1+ǫ size of S goes down by a factor larger than 1 + ǫ. Starting with n nodes, in at most O(log1+ǫ (n)) iterations the set gets depleted.

(b) [10 points] We show through the following steps that the density of the set returned by the algorithm is at most a factor 2(1 + ǫ) smaller than ρ∗ (G). i. Assume S ∗ is the densest subgraph of G. Prove that for any v ∈ S ∗ , we have: degS ∗ (v) ≥ ρ∗ (G). ii. Consider the first iteration of the while loop in which there exists a node v ∈ S ∗ ∩ A(S). Prove that 2(1 + ǫ)ρ(S) ≥ ρ∗ (G). ˜ ≥ iii. Conclude that ρ(S)

1 ρ∗ (G). 2(1+ǫ)

⋆ SOLUTION: If for some v ∈ S ∗ , we have degS∗ (v) < ρ∗ (G), then the set S ∗ \ {v } |E[S ∗ ]|−degS ∗ > ρ(S ∗ ), which contradicts the assumption of S ∗ being the densest has density |S ∗ |−1 subgraph. If there is a first iteration where there exists a node v ∈ S ∗ ∩ A(S), we have: 2(1 + ǫ)ρ(S) ≥ degS ∗ (v) ≥ ρ∗ (G), hence 2(1 + ǫ)ρ(S) ≥ ρ∗ (G). Finally, notice that since we proved in the last part of the problem that the set S eventually gets depleted, there must be an iteration in which nodes from S ∗ start to get removed from S. We proved at that iteration 1 ˜ ≥ ρ(S) finishes the proof. ρ∗ (G). Noticing ρ(S) ρ(S) ≥ 2(1+ǫ)

What to submit (a)

i. Proof of |A(S)| ≥

ǫ |S |. 1+ǫ

ii. Proof of number of iterations for algorithm to terminate. (b)

i. Proof of degS ∗ (v) ≥ ρ∗ (G). ii. Proof of 2(1 + ǫ)ρ(S) ≥ ρ∗ (G). ˜ ≥ 1 ρ∗ (G). iii. Conclude that ρ(S) 2(1+ǫ)...