k-Means Clustering PDF

Title	k-Means Clustering
Author	Grace Jiang
Course	Big Data Analytics
Institution	University of Pennsylvania
Pages	10
File Size	325.4 KB
File Type	PDF
Total Downloads	54
Total Views	136

Preview

CLICK TO PREVIEW PDF

Summary

Notes on k-Means Clustering...

Description

Clustering, 3/25 https://colab.research.google.com/drive/1IwLI6jBEd6KVqXyNGMocGN8zHHyT7qx1#scrollTo=aW2LPdS pfZUx

Clustering Coefficient: Distance between 2 instances. Eg: euclidean distance between 2 points on euclidean plane, number overlapping words between 2 documents, similar market segments between 2 groups of people

k-Means Basic Clustering 1. Choose k random points as centroids 2. Grab all points closest to it

centroids = np.zeros((k, 2)) cluster_assignments = [0 for i in range(len(X))] for i in range(0, k): centroids[i] = X[randint(0, X.shape[0])] 3. Recenter off the mean of all the points (pick centroid) 4. Continue doing so until points don’t change (stable cluster), or stopping at convergence (every cluster label remains the same)

while changed: changed = False # Assign points to clusters for i, x in enumerate(X): nearest = get_nearest(centroids, x) # We changed a cluster mapping! if nearest != cluster_assignments[i]: changed = True cluster_assignments[i] = nearest # Recompute clusters for i in range(len(centroids)): points = [j for j, v in enumerate(cluster_assignments) if v gh i]

X_subset = np.array([[X[i, 0], X[i, 1]] for i in points]) centroids[i][0] = np.sum(X_subset[m, 0]) / len(points) centroids[i][1] = np.sum(X_subset[m, 1]) / len(points) return (centroids, np.array(cluster_assignments))

Or use pre-implemented one:

from sklearn.cluster import KMeans km = KMeans(n_clusters = 2, 'random', n_init = 1, max_iter = 300, random_state = 0) km.fit(X_embedded) km.cluster_centers_ km.labels_

Distortion: The Most Common Measure of k-Means Clustering Quality Minimize within-cluster sum of squared errors ( Also known as Distortion function

k-Means for Big Data 1. How to scale k-Means from single machine to big data? 2. How to identify the "right" number of clusters, k? 3. Are there tricks to speed up convergence?

)

Scaling SQL SparkSQL Spark MLLib

Pseudocode

while not (stopping condition): for x in {x_1, ..., x_n}: assign x to nearest to closest centroid C_j for j in range (k) C_j = mean of points assigned to cluster j

SparkSQL km_data = pd.DataFrame(pd.read_csv('https://www.cis.upenn.edu/~zives/collegedata.csv').reset_index()) %%spark max_iter = 3 initialize() for i in range(max_iter): assign_clusters() compute_centroids()

def initialize(k): q = """ SELECT index as cluster_id, apps, ..., perc_alumni, expend, grad_rate FROM km_data k LIMIT""" km_clusters_sdf = spark.sql(q + ' ' + str(k) + ')')

A. Assigning Clusters 3 parts, because SparkSQL is a bit limited: (1) Compute data ←→ cluster distance as sum of squares (we'll ignore sqrt) cluster_id, data_id, dist

# finding shortest distance q = """ SELECT c.cluster_id, k.index, (pow(c.apps - k.apps, 2) + pow(c.accept - k.accept, 2) + pow(c.expend - k.expend, 2) + ...) as dist FROM km_cluster_centroids c CROSS JOIN km_data k """ km_data_to_cluster_dist = spark.sql(q)

(2) Find shortest distance for each data item (for each data_id, look at shortest dist and return corresponding min_dist) data_id, min_dist

q = """ SELECT index, min(dist) as dist FROM km_data_to_cluster_dist GROUP BY index """ km_data_to_cluster_best_dist = spark.sql(q)

(3) Find cluster ID whose distance is shortest cluster_id, data_id

q = """ SELECT kd.index, kd.cluster_id FROM km_data_to_cluster_dist kd JOIN km_data_to_cluster_best_dist kb ON kd.index = kb.index WHERE kd.dist = kb.dist """ km_data_to_cluster = spark.sql(q)

B. Recomputing Centroids

q = """ SELECT cluster_id, AVG(apps) AS apps, AVG(accept) AS accept, AVG(enroll) AS enroll, AVG(top10perc) AS top10perc, AVG(top25perc) AS top25perc, ..., AVG(p_undergrad) AS p_undergrad, ..., AVG(room_board) AS room_board, AVG(books) AS books, ..., AVG(grad_rate) AS grad_rate FROM km_data_to_cluster kc JOIN km_data kd ON kc.index = kd.index GROUP BY kc.cluster_id """ km_clusters_sdf = spark.sql(q)

MLLib Version df → input df for MLLib the input df: index and features (vector)

# Convert from dataframe features to a "features" column that is a vector vecAssembler = VectorAssembler(inputCols=['apps', 'accept', ..., 'grad_rate'], outputCol='features') df_kmeans = vecAssembler.transform(km_data_sdf).select('index', 'features') # Trains a k-means model kmeans = KMeans().setK(2).setSeed(5) model = kmeans.fit(df_kmeans) predictions = model.transform(df_kmeans)

Advanced k-Means Clustering How To Pick K? Try different values for k, see how well clusters "fit" the data Measure using distortion (sum of squared errors)

The Elbow Method Plot sum of squared sitances (Euclidean distances) between each point and cluster centroid for different values of k "Within-cluster" SSE is also called inertia

distortions = [] max_k = 10 for i in range(1, max_k + 1): km = KMeans(n_clusters = i, init = 'random', n_init = 1, max_iter = 300, random_state = 0) km.fit(X) # The distortion distortions.append(km.inertia_) plt.plot(range(1, max_k + 1), distortions, marker = 'o')

Where elbow tapers off, so k = 7

Can we make k-Means converge faster/better? Poor initial centroids can result in bad clustering or slow convergence Two initialization approaches to help: Repeated random initialization ("random restart") K-means++

Repeated Random Initialization Run K-means algorithm multiple times on a dataset and choose model that produces minimum distortion (SSE) For i in range (20): Randomly pick cluster centroids from poitns X Run k-means Compute distortion function (SSE) Pick the clustering with minimum SSE

K-means++ Initialization

Basically choose next centroid as far away from previous centroids Place initial centroids far away from one another: 1. Initialize an empty first centroid from 2. For each that is 3. Choose one new data

set M (for storing selected centroids); Randomly select the the input sample and assign it to M not in M, find the distance to the closest centroid in M point at random as a new centroid using probability distribution

~ 4. Repeat (2) and (3) until K centroids have been chosen Then do classic k-means

Summary Each cluster is represented by a "prototype" As we saw it in Euclidean space: a centroid Also known as mediod (k-mediods) One of most scalable methods for clustering

Limitations Spherically shaped clusters Sensitive to outliers many other clustering methods exist

Hierachical Clustering Start with single-item clusters, then build successively bigger and bigger clusters Or: start with one cluster, breka into the most logical sub-clusters, repeat

(1) Agglomerative (AHC/HAC), bottom-up Start with each sample as individual cluster Cluster and merge closest pairs of clusters until only one cluster remains (2) Divisive, top-down Start with one cluster that contains all samples Split the cluster into smaller clusters until each cluster contains one sample

Single Linkage: Compute distances between most similar members for each pair of clusters, merge the clusters with the smallest min-distance Complete-Linkage: Compute distance between most dissimilar members for each pair of clusters, merge the clusters with the smallest max-distance

Pre-compute into a distance matrix

Agglomerative Clustering Steps: 1. Compute distance matrix dist between all pairs of points

# pdist computes pairwise distances # squareform converts 1D distance vector to square-form distance matrix from scipy.spatial.distance import pdist, squareform row_dist_df = pd.DataFrame( squareform(pdist(revised_df, metric='euclidean')), columns=recivsed_df.index.to_list(), index=revised_df.index.to_list())

2. Repeat... 1. Iterate over pairs of clusters A, B, compute their distance: 1. Look at all pairwise distances dist[a, b] between 2. Merge the pair of clusters within min distance between most distant members 3. Update the distance matrix ...Until a single cluster remains.

from scipy.cluster import hierarchy from scipy.cluster.hierarchy import linkage # create the clusters row_clusters = linkage(row_dist_df, method='complete') # plot them in a dendrogram plt.figure() dn = hierarchy.dendrogram(row_clusters, labels=recivsed_df.index.to_list())

Hierachical Clustering in Spark MLLib contains divisive, not agglomerative algorithms for clustering

Summary Hierarchical Clustering is an alternative to k-means clustering

Nice things: Easier to visualize and interpret with a taxonomy ("Dendrogram" plots!) We don't need to specify the number of clusters upfront

Limitations: Doesn't scale well to big problems Not obvious how up the hierarchy to go. eg: when do we have the right number of clusters?...