Homework 5 PDF

Title	Homework 5
Author	Ashish Kumar
Course	Introduction to Data Science
Institution	University of Maryland Baltimore County
Pages	4
File Size	164.4 KB
File Type	PDF
Total Downloads	23
Total Views	129

Preview

CLICK TO PREVIEW PDF

Summary

K-Means, BIRCh...

Description

Homework 5 10.2 Suppose that the data mining task is to cluster points (with (x, y) representing location) into three clusters, where the points are: A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9). The distance function is Euclidean distance. Suppose initially we assign A 1, B1, and C1 as the center of each cluster, respectively. Use the k-means algorithm to show only (a) The three cluster centers after the first round of execution. The three clusters are A 1, B1, and C1, so calculating the Euclidean distance between of each point from all the three clusters. First Iteration: A1 Centroid:1(A1 0 ) Centroid:2(B1) 3.6 Centroid:3(C1 8.06 )

A2 5

A3 8.48

B1 3.6

B2 7.07

B3 7.21

C1 8.06

C2 2.23

4.24 3.16

5 7.28

0 7.21

3.6 6.7

4.12 5.38

7.21 0

1.41 7.61

The three clusters with cluster points are: Cluster 1 = {A1(2,10)} Cluster 2 = {A3(8,4), B1(5,8), B2(7,5), B3(6,4), C2(4,9)} Cluster 3= {A2(2,5), C1(4,9)} Calculating the center(centroid) after the first round: Center1 = (2,10) Center2 = {(5+8+7+6+4)/5, (8+4+5+4+9)/5} = (6,6) Center3 = (1.5, 3.5)

(b) The final three clusters. Second iteration: A1 A2 Centroid:1(A1 0 5 ) Centroid:2(B1) 4.12 4.12 Centroid:3(C1 6.51 1.58 )

A3 8.48

B1 3.6

B2 7.07

B3 7.21

C1 8.06

C2 2.23

2.82 6.51

2.23 5.7

1.41 5.7

2 4.52

6.4 1.58

3.6 6.04

After the third iteration the final clusters are: Cluster 1: {(A1, C2, B1} Cluster 2: {(A3, B2, B3)}

Cluster 3: {(A2, C1)}

10.10 Why is it that BIRCH encounters difficulties in finding clusters of arbitrary shape but OPTICS does not? Propose modifications to BIRCH to help it find clusters of arbitrary shape. Ans: The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) encounters difficulties in finding clusters of arbitrary size because it uses Euclidean distance and intercluster proximity which forms the spherical shaped structures and in case of birch if the structure is not spherical it cannot achieve a good clustering effect. One the other hand OPTICS is a density-based clustering method which form dense regions based on the connected points within a defined radius and can form clusters of arbitrary sizes like ‘S’ shape or oval shape. We can modify BIRCH to use connectivity-based and density-based distance measure instead of Euclidean distance at low levels of the method while forming the CF-tree, this can help in making arbitrary shape clusters.

10.12 Present conditions under which density-based clustering is more suitable than partitioningbased clustering and hierarchical clustering. Give application examples to support your argument. Ans: Partitioning based clustering and hierarchical clustering uses Euclidean distance to measure distance and forms clusters which can only make clusters of spherical shape and encounters difficulties in making clusters of arbitrary size. Partitioning methods also have difficulties in finding clusters of different densities and diameter. There are two major flaws of hierarchical methods, the first one is they are not scalable and the second one is they don’t have the ability to undo what was done in the previous step which makes density based algorithm more suitable than others. An example of partitioning based clustering based is:

While on the other hand density-based methods like DBSCAN, OPTICS, DENCLUE can form clusters of arbitrary shape and can handle noise in the data. Each cluster are collection of points with high density compared to the points outside the clusters.

An example of density-based clustering is:

10.18 Suppose that you are to allocate a number of automatic teller machines (ATMs) in a given region so as to satisfy a number of constraints. Households or workplaces may be clustered so that typically one ATM is assigned per cluster. The clustering, however, may be constrained by two factors: (1) obstacle objects (i.e., there are bridges, rivers, and highways that can affect ATM accessibility), and (2) additional user-specified constraints such as that each ATM should serve at least 10,000 households. How can a clustering algorithm such as k-means be modified for quality clustering under both constraints? Ans: Few modifications that can be done in K-means satisfy both the condition are: 

The first major change should be the change in the distance when obstacles like bridges, rivers, and highways are found. The distance should be changed to actual distance when obstacles are encountered instead of the Euclidean distance.



Since K-means uses the centroid value, outliers can badly affect the place of centroid. Instead of considering the centroid i.e. that is the mean value of the objects in the cluster, we can use an actual object to represent the cluster which is also known as K-medoids. This will help in keeping the boundary condition of at least 10,000 households are served by each ATM and restricts on obstacle objects.



Instead of using large clusters, objects can be kept in small micro clusters and advantage of doing that would be obstacles can be found between the micro clusters and it won’t affect the accessibility of ATM’s to the people....