Lecture Notes HWS2021 PDF

Title	Lecture Notes HWS2021
Author	Luis Garza
Course	Data Mining
Institution	Universität Mannheim
Pages	44
File Size	2.8 MB
File Type	PDF
Total Downloads	60
Total Views	143

Preview

CLICK TO PREVIEW PDF

Summary

Summary of all Data Mining I lectures with all definitions, concepts and algorithm learned during the semester...

Description

Data Mining (Prof. Paulheim) Summary by Sascha Ulbrich

Content Data Mining (Prof. Paulheim) ................................................................................................................. 1 1.

Overview .................................................................................................................................... 2

2.

Clustering (descriptive) ............................................................................................................... 4 K-Means Clustering ........................................................................................................................ 4 DBSCAN .......................................................................................................................................... 6 Hierarchical Clustering ................................................................................................................... 7 Proximity Measures ........................................................................................................................ 9

3.

Classification (predictive) .........................................................................................................11 K Nearest Neighbors ..................................................................................................................... 11 Nearest Centroids .........................................................................................................................12 Bayes Classifier ............................................................................................................................. 12 Decision Tree Classifier................................................................................................................. 15 Decision Tree Induction ............................................................................................................ 16 Model Evaluation .....................................................................................................................19 Rule-based Classifiers ................................................................................................................... 23 Alternative Classification Models .................................................................................................28 Parameter Tuning ......................................................................................................................... 30

4.

Association Rule Mining (descriptive).......................................................................................31 Apriori ..........................................................................................................................................32 FP-growth ..................................................................................................................................... 35 Measures ...................................................................................................................................... 36

5.

Text Mining (both descriptive and predictive) .......................................................................... 38 Text Preprocessing .......................................................................................................................38 Feature Creation...........................................................................................................................41 Feature Selection ..........................................................................................................................42 Pattern Discovery (Data Mining) ..................................................................................................42

1. Overview • •

•

•

•

Data Mining is a non-trivial process of identifying valid, novel, potentially useful, ultimately understandable patterns in data. Descriptive (unsupervised) methods o find patterns in data o The set of classes/clusters is not known before o We want to explore the data to find some intrinsic structures in it o e.g., which products are often bought together? Predictive (supervised) methods o predict unknown or future values of a variable ▪ given observations (e.g., from the past) o The set of classes is known before o Class attributes are usually provided by human annotators o Patterns are used for prediction of the target attribute for new data o e.g., will a person click a banner? ▪ given his/her browsing history Selection and Exploration o Selection ▪ What data is available? ▪ What do I know about the provenance of this data? ▪ What do I know about the quality of the data? o Exploration ▪ Get an initial understanding of the data ▪ Calculate basic summarization statistics ▪ Visualize the data ▪ Identify data problems such as outliers, missing values, duplicate records o Visual Data Mining ▪ For example as maps ▪ Map showing the most popular photo locations in the world, generated from Panoramio logs Data Preprocessing o Transform data into a representation that is suitable for the chosen data mining methods ▪ number of dimensions ▪ scales of attributes (nominal, ordinal, numeric) ▪ amount of data (determines hardware requirements) o Methods ▪ Aggregation, sampling ▪ Dimensionality reduction / feature subset selection ▪ Attribute transformation / text to term vector ▪ Discretization and binarization o Good data preparation is key to producing valid and reliable models o Data preparation estimated to take 70-80% of the time and effort of a data mining project!

•

Data Mining o Input: Preprocessed Data o Output: Model / Patterns ▪ 1. Apply data mining method ▪ 2. Evaluate resulting model / patterns ▪ 3. Iterate: • Experiment with different parameter settings • • •

•

Experiment with different alternative methods Improve preprocessing and feature generation Combine different methods

Deployment o Use model in business context

o

2. Clustering (descriptive) •

•

•

•

Goals o

Finding groups of objects such that ▪ the objects in a group will be similar to one another (Intra-cluster distances are minimized) ▪ and different from the objects in other groups (Inter-cluster distances are maximized) Get a better understanding of the data.

o Types o Partitional Clustering ▪ A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset o Hierarchical Clustering ▪ A set of nested clusters organized as a hierarchical tree Aspects o Clustering algorithm ▪ Partitional Algorithms ▪ Hierarchical Algorithms ▪ Density-based Algorithms o Proximity (similarity, or dissimilarity) measure ▪ Euclidean Distance ▪ Cosine Similarity ▪ Domain-specific Similarity Measures o Clustering Quality ▪ Intra-clusters distance -> minimized ▪ Inter-clusters distance -> maximized Applications o Market Research (groups of customers) o Product Grouping (offers of same/similar products) o Social Network Analysis (Identifying communities of people) o Grouping Search Engine Results (Related pages) o Image Recognition (portions of image -> object)

K-Means Clustering •

• •

•

Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid Number of clusters, k, must be specified manually Algorithm

•

•

Alternative Convergence Criteria o no (or minimum) re-assignments of data points to different clusters o no (or minimum) change of centroids, or o minimum decrease in the sum of squared errors (SSE) o Stop after X iterations Evaluation o The most common cohesion measure is the Sum of Squared Error (SSE) ▪ For each point, the error is the distance to the nearest centroid ▪ To get SSE, we square these errors and sum them

• • •

•

Cj is the j-th cluster mj is the centroid of cluster Cj (the mean vector of all the data points in Cj) • dist(x, mj) is the distance between data point x and centroid mj o Given several clusterings, we should prefer the one with the smallest SSE Weaknesses o Initial Seeds ▪ Results can vary significantly depending on initial choice of seeds (number and position) ▪ Approaches to increase the chance of finding good clusters: restart a number of times with different random seeds o chose the resulting clustering with the smallest sum of squared error (SSE) ▪ run k-means with different numbers of k • Note that the SSE of k-means with different values of k cannot be compared to each other! • Think: what happens for k → n (number of examples)? ▪ X-means • start with small k, then split large clusters and see if result improves Outlier Handling ▪ Possible remedy: • remove data points far away from centroids • to be safe: monitor these possible outliers over a few iterations and then decide to remove them ▪ Other remedy: random sampling •

o

• • • •

choose a small subset of the data points the chance of selecting an outlier is very small if the data set is large enough after determining the centroids based on samples, assign the rest of the data points also a method for improving runtime performance!

•

•

Variation: K-Medoids o K-Medoids is a K-Means variation that uses the medians of each cluster instead of the mean o Medoids are the most central existing data points in each cluster o K-Medoids is more robust against outliers as the median is not affected by extreme values Summary

o

DBSCAN • •

•

DBSCAN is a density-based algorithm o Density = number of points within a specified radius (Eps) Divides data points in three classes: o A point is a core point if it has more than a specified number of points (MinPts) within Eps ▪ These are points that are at the interior of a cluster o A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point o A noise point is any point that is not a core point or a border point

o Algorithm

o

Determining EPS and MinPts ▪ Idea: for points in a cluster, their kth nearest neighbors are at roughly the same distance ▪ Noise points have the kth nearest neighbor at farther distance ▪ So, plot sorted distance of every point to its kth nearest neighbor Strengths o Resistant to Noise o Can handle clusters of different shapes and sizes Weaknesses o Varying densities o High-dimensional data o

•

•

Hierarchical Clustering • •

•

Produces a set of nested clusters organized as a hierarchical tree. Can be visualized as a Dendrogram o A tree like diagram that records the sequences of merges or splits. o The y-axis displays the former distance between merged clusters.

o Algorithm

o

o

o

Define Inter-Cluster Similarity ▪ Single Link (MIN) • Similarity of two clusters is based on the two most similar (closest) points in the different clusters • Determined by one pair of points, i.e., by one link in the proximity graph • Pro: Can handle non-elliptic shapes • Con: Sensitive to outliers ▪ Complete Link (MAX) Similarity of two clusters is based on the two least similar (most distant) points in the different clusters • Determined by all pairs of points in the two clusters • Pro: Less sensitive to noise and outliers • Con: o biased towards globular clusters o tends to break large clusters Group Average •

▪

•

Proximity of two clusters is the average of pair-wise proximity between points in the two clusters.

• •

▪ •

•

Need to use average connectivity for scalability since total proximity favors large clusters • Compromise between Single and Complete Link • Strengths o Less susceptible to noise and outliers • Limitations o Biased towards globular clusters Distance Between Centroids

o Strengths o We do not have to assume any particular number of clusters ▪ Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level o May be used to look for meaningful taxonomies ▪ taxonomies in life sciences ▪ taxonomy of customer groups Problems and Limitations o Greedy algorithm: ▪ decision taken (i.e., merge two clusters) cannot be undone o Different variants have problems with one or more of the following ▪ Sensitivity to noise and outliers ▪ Difficulty handling different sized clusters and convex shapes ▪ Breaking large clusters

o

High Space and Time Complexity ▪ O(N2) space since it uses the proximity matrix (N: number of data points) ▪ O(N3) time in many cases ▪ N steps processing the similarity matrix (N2) • Complexity can be reduced to O(N log(N)) time for some approaches

Proximity Measures •

•

•

•

Similarity o Numerical measure of how alike two data objects are (higher: more alike) o Often falls in the range [0,1] Dissimilarity (or distance) o Numerical measure of how different are two data objects (higher: less alike) o Minimum dissimilarity is often 0 o Upper limit varies A wide range of different measures is used depending on the requirements of the application

o All those measures cover the proximity of single attribute values o But we usually have data points with many attributes ▪ e.g., age, height, weight, sex... o -> proximity measures for data points: ▪ Euclidean Distance

• •

Pitfalls: o We are easily comparing apples and oranges ▪ changing units of measurement changes the clustering result! o Recommendation: o use normalization before clustering o generally: for all data mining algorithms involving distances

▪

Similarity of Binary Attributes

•

Common situation is that objects, p and q, have only binary attributes o e.g., customer bought an item (yes/no) Compute similarities using the following quantities o M01 = the number of attributes where p was 0 and q was 1 o M10 = the number of attributes where p was 1 and q was 0 o M00 = the number of attributes where p was 0 and q was 0 o M11 = the number of attributes where p was 1 and q was 1 Symmetric Binary Attributes

•

o Asymmetric Binary Attributes

•

o SMC vs Jaccard

•

•

o

3. Classification (predictive) •

•

•

•

•

•

Given o a set of labeled records, consisting of ▪ data fields (a.k.a. attributes or features) ▪ a class label (e.g., true/false) Generate a function f(r) ▪ input: a record ▪ output: a class label o which can be used for classifying previously unseen records Variants: o single class problems (e.g., only true/false) o multi class problems o multi label problems (more than one class per record, not covered in this lecture) o hierarchical multi class/label problems (with class hierarchy, e.g., product categories) Goal: previously unseen records should be assigned a class as accurately as possible o A test set is used to validate the accuracy of the model o Training set may be split into training and validation data k-NN, Nearest Centroids, and Naïve Bayes are all “lazy” methods o They do not build an explicit model! o “learning” is only performed on demand for unseen records Eager Learning: o Goals: ▪ classify unseen instances ▪ learn a model o Model ▪ explains how to classify unseen instances ▪ sometimes: interpretable by humans

K Nearest Neighbors • •

• •

•

Requires a notion of similarity (i.e., what is “near”?) Review: similarity measures for clustering o similarity of individual data values o similarity of data points Think about scales and normalization! Requires o Set of stored records o Distance metric to compute distance between records o Value of k (number of nearest neighbors) Algorithm o Compute distance to each training record o Identify k nearest neighbors o Use class labels of nearest neighbors to determine class label of unknown record by ▪ Taking majority vote ▪ Weighting the vote according to distance

•

Con o

o o •

Choosing a good value for k ▪ If k is too small, sensitive to noise points ▪ If k is too large, neighborhood may include points from other classes ▪ Rule of thumb: Test k values between 1 and 10 Slow as training data needs to be searched Assumes all attributes are equally important ▪ Remedy: Attribute selection or using attribute weights

Pro o o

Often very accurate Can handle decision boundaries which are not parallel to the axes

Nearest Centroids • •

• •

a.k.a. Rocchio classifier Training: compute centroid for each class o center of all points of that class o like: centroid for a cluster Classification: o assign each data point to nearest centroid vs k-NN o Some data points are mislabeled: k-NN loses performance, NC is stable o One class is significantly smaller than the other: k-NN loses performance, NC is stable o Outliers are contained in the dataset: k-NN is stable, NC loses performance o k-NN ▪ slow at classification time (linear in number of data points) ▪ requires much memory (storing all data points) ▪ robust to outliers o Nearest Centroid ▪ fast at classification time (linear in number of classes) ▪ requires only little memory (storing only the centroids) ▪ robust to label noise ▪ robust to class imbalance

Bayes Classifier •

Based on Bayes Theorem o important theorem in probability theory

o

• • •

• •

o Computes one conditional probability P(C|A) out of another P(A|C) o given that the base probabilities P(A) and P(C) are known Useful in situations where P(C|A) is unknown o while P(A|C), P(A) and P(C) are known or easy to determine/estimate Consider each attribute and class label as random variables Given a record with attributes (A1, A2,…,An) o Goal is to predict class C o Specifically, we want to find the value of C that maximizes P(C| A1, A2, …,An )

Estimate Probabilities from data

o

•

o Naïve Bayes Classifier

o

•

•

•

o Handling missing values o Missing values may occur in training and classification examples o Training: Instance is not included in frequency count for attribute value-class combination. o Classification: Attribute will be omitted from calculation. Zero Frequency Problem o If one of the conditional probabilities is zero, then the entire expression becomes zero o And it is not unlikely that an exactly same data point has not yet been observed Summary o Robust to isolated noise points ▪ they have a small impact on the probabilities o Handle missing values by ignoring the instance during probability estimate calculations o Robust to irrelevant attributes o Independence assumption may not hold for some attributes ▪ Use other techniques such as Bayesian Belief Networks (BBN) o Naïve Bayes works surprisingly well. ▪ even if independence assumption is clearly violated ▪ Classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class o However: Adding too many redundant attributes will cause problems ▪ Solution: Select attribute subset as Naïve Bayes often works as well or better with just a fraction of all attributes. o Technical advantages: ▪ Learning Naïve Bayes classifiers is computationally cheap as probabilities can be estimated doing one pass over the training data ▪ Storing the probabilities does not require a lot of memory o k-NN vs. Naïve Bayes ▪ Computation ▪ ▪

• Naïve Bayes is often faster Naïve Bayes uses all data points • less sensitive to outliers and l...