Ml 06 B1 nearest neighbor handout 2up PDF

Title	Ml 06 B1 nearest neighbor handout 2up
Author	maryam bahrami
Course	Machine Learning
Institution	Universität Hildesheim
Pages	24
File Size	862.4 KB
File Type	PDF
Total Downloads	81
Total Views	118

Preview

CLICK TO PREVIEW PDF

Summary

KNN, Nearest Neighbor Algorithm. Distance Measures...

Description

Machine Learning

Syllabus Fri. 27.10. Fri. Fri. Fri. Fri. Fri. Fri. Fri. Fri. Fri.

3.11. 10.11. 17.11. 24.11. 1.12. 8.12. 15.12. 12.1. 19.1.

Fri. 26.1. Fri. 2.2. Fri. 9.2.

(1)

0. Introduction

(2) (3) (4) (5)

A. Supervised Learning: Linear Models & Fundamentals A.1 Linear Regression A.2 Linear Classification A.3 Regularization A.4 High-dimensional Data

(6) (7) (8) (9) (10)

B. Supervised Learning: Nonlinear Models B.1 Nearest-Neighbor Models B.2 Neural Networks B.3 Decision Trees B.4 Support Vector Machines B.5 A First Look at Bayesian and Markov Networks

(11) (12) (13)

C. Unsupervised Learning C.1 Clustering C.2 Dimensionality Reduction C.3 Frequent Pattern Mining

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 33 Machine Learning

Outline

1. Distance Measures

2. K -Nearest Neighbor Models

3. Scalable Nearest Neighbor

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 33

Machine Learning

Outline

1. Distance Measures

2. K -Nearest Neighbor Models

3. Scalable Nearest Neighbor

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 33 Machine Learning

Motivation So far, regression and classification methods covered in the lecture can be used for ◮ numerical variables, ◮

binary variables (re-interpreted as numerical), and

nominal variables (coded as set of binary indicator variables). often called scalar variables. ◮

Often one is also interested in more complex variables such as ◮ set-valued variables, ◮ ◮

sequence-valued variables (e.g., strings), ...

often called structured variables or complex variables. Note: A complex variable in this sense has nothing to do with complex numbers. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 33

Machine Learning

Motivation There are two kinds of approaches to deal with complex variables: I. feature extraction 1. derive binary or numerical variables, 2. then use standard methods on the feature vectors.

II. kernel methods 1. establish a distance measure between two values, 2. then use methods that use only distances between objects (but no feature vectors).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 2 / 33 Machine Learning

Distance measures Let d be a distance measure (also called metric) on a set X , i.e., d : X × X → R+ 0 with 1. d is positiv definite: d(x, y ) ≥ 0 and d(x, y ) = 0 ⇔ x = y 2. d is symmetric: d(x, y ) = d(y , x ) 3. d is subadditive: d(x, z) ≤ d(x, y ) + d(y , z ) (triangle inequality) (for all x, y , z ∈ X .) Example: Euclidean metric on X := Rn : n 1

d(x, y ) := ( (xi − yi )2 ) 2 X i=1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 3 / 33

Machine Learning

Minkowski Metric / Lp metric

Minkowski Metric / Lp metric on X := Rn : with p ∈ R, p ≥ 1.

n X 1 d(x, y ) := ( p p |xi − yi | ) i=1

p = 1 (taxicab distance; Manhattan distance): d(x, y ) :=

n X i=1

|xi − yi |

p = 2 (euclidean distance): n X 1 d(x, y ) := ( (xi − yi )2 ) 2 i=1

p = ∞ (maximum distance; Chebyshev distance): n

d(x, y ) := max |xi − yi | i=1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 4 / 33 Machine Learning

Minkowski Metric / Lp metric / Example Example: 

 1 x :=  3  , 4

 2 y :=  4  1 

dL1 (x, y ) =|1 − 2| + |3 − 4| + |4 − 1| = 1 + 1 + 3 = 5 dL2 (x, y ) =

q

(1 − 2)2 + (3 − 4)2 + (4 − 1)2 =

√ √ 1 + 1 + 9 = 11 ≈ 3.32

dL∞ (x, y ) = max{|1 − 2|, |3 − 4|, |4 − 1|} = max{1, 1, 3} = 3

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 5 / 33

Machine Learning

Similarity measures Instead of a distance measure sometimes similarity measures are used, i.e., sim : X × X → R0+ with ◮

sim is symmetric: sim(x, y ) = sim(y , x).

Some similarity measures have stronger properties: ◮ ◮

sim is discerning: sim(x, y ) ≤ 1 and sim(x, y ) = 1 ⇔ x = y sim(x, z) ≥ sim(x, y ) + sim(y , z) − 1.

Some similarity measures have values in [−1, 1] or even R where negative values denote “dissimilarity”. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 6 / 33 Machine Learning

Distance vs. Similarity measures

A discerning similarity measure can be turned into a semi-metric (pos. def. & symmetric, but not necessarily subadditive) via d(x, y ) := 1 − sim(x, y ) In the same way, a metric can be turned into a discerning similarity measure (with values possibly in ] − ∞, 1]).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 7 / 33

Machine Learning

Cosine Similarity

The angle between two vectors in RN can be used as distance measure hx, y i ) d(x , y ) := angle(x , y ) := arccos( ||x ||2 ||y ||2 To avoid the arccos, often the cosine of the angle is used as similarity measure (cosine similarity): sim(x, y ) := cos angle(x, y ) :=

hx, y i ||x ||2 ||y ||2

Example: 

 1 x :=  3  , 4



 2 y :=  4  1

18 1·2+3·4+4·1 √ = √ √ ≈ 0.77 sim(x, y ) = √ 1 + 9 + 16 4 + 16 + 1 26 21 Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

b

8 / 33

Machine Learning

Distances for Nominal Variables 1. Binary variables: ◮ there is only one reasonable distance measure: d(x, y ) := 1 − I(x = y ) ◮

with I(x = y ) :=



1 if x = y 0 otherwise

This coincides with ◮

L∞ , 21 L1 and

√1 L2 2

distance

for the indicator/dummy variables. 2. Nominal variables (with more than two possible values): ◮ The same distance measure is useful. 3. Hierarchical variables (i.e., a nominal variable with levels arranged in a hierarchy) ◮ there are more advanced distance measures (not covered here). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 9 / 33

Machine Learning

Distances for Set-valued Variables For set-valued variables (which values are subsets of a set A) the Hamming distance often is used: d(x , y ) := |(x \ y ) ∪ (y \ x)| = |{a ∈ A | I(a ∈ x ) 6= I(a ∈ y )}| (= the number of elements contained in only one of the two sets). Example: d({a, e, p, l}, {a, b, n}) = 5,

d({a, e , p, l}, {a, e , g , n, o, r }) = 6

Also often used is the similarity measure Jaccard coefficient: sim(x, y ) := Example: sim({a, e, p, l}, {a, b, n}) =

1 , 6

|x ∩ y | |x ∪ y |

sim({a, e, p, l}, {a, e, g , n, o, r }) =

2 8

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 10 / 33 Machine Learning

Distances for Strings / Sequences edit distance / Levenshtein distance: d(x, y ) := minimal number of single character deletions, insertions or substitutions to transform x in y Examples: d(man, men) = d(house, spouse) = d(order, express order) =

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 11 / 33

Machine Learning

Distances for Strings / Sequences edit distance / Levenshtein distance: d(x, y ) := minimal number of single character deletions, insertions or substitutions to transform x in y Examples: d(man, men) =1 d(house, spouse) =2 d(order, express order) =8

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 11 / 33 Machine Learning

Distances for Strings / Sequences The edit distance is computed recursively. With x1:i := (xi ′ )i ′ =1,...,i = (x1 , x2 , . . . , xi ),

i ∈N

we compute the number of operations to transform x1:i into y1:j as c(x1:i , y1:j ) := min{ c(x1:i −1 , y1:j ) + 1, // delete xi , x1:i −1 y1:j c(x1:i , y1:j−1 ) + 1, // x1:i y1:j−1 , insert yj c(x1:i −1 , y1:j−1 ) + I (xi 6= yj )} // x1:i −1 y1:j−1 , substitute yj for xi starting from c(x1:0 , y1:j ) = c(∅, y1:j ) := j c(x1:i , y1:0 ) = c(x1:i , ∅) := i

// insert y1 , . . . , yj // delete x1 , . . . , xi

Such a recursive computing scheme is called dynamic programming. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 12 / 33

Machine Learning

Distances for Strings / Sequences Example: compute d(excused, exhausted). 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 y [j]/x[i] e x c u s e d d e t s u a h x e

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 13 / 33 Machine Learning

Distances for Strings / Sequences Example: compute d(excused, exhausted). d e t s u a h x e y [j]/x[i]

9 8 7 6 5 4 3 2 1 0

8 7 6 5 4 3 2 1 0 1 e

7 6 5 4 3 2 1 0 1 2 x

7 6 5 4 3 2 1 1 2 3 c

6 5 4 3 2 2 2 2 3 4 u

5 4 3 2 3 3 3 3 4 5 s

4 3 3 3 4 4 4 4 5 6 e

3 4 4 4 5 5 5 5 6 7 d

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 13 / 33

Machine Learning

Distances for Strings / Sequences Example: compute d(excused, exhausted). d e t s u a h x e

9 8 7 6 5 4 3 2 1 0

y [j]/x[i]

8 7 6 5 4 3 2 1 0 1 e

7 6 5 4 3 2 1 0 1 2 x

7 6 5 4 3 2 1 1 2 3 c

6 5 4 3 2 2 2 2 3 4 u

5 4 3 2 3 3 3 3 4 5 s

4 3 3 3 4 4 4 4 5 6 e

3 4 4 4 5 5 5 5 6 7 d

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 13 / 33 Machine Learning

Outline

1. Distance Measures

2. K -Nearest Neighbor Models

3. Scalable Nearest Neighbor

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 14 / 33

Machine Learning

Neighborhoods Let d be a distance measure. For a dataset D ⊆X ×Y and x ∈ X let

D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}

be an enumeration with increasing distance to x, i.e., d(x, xn ) ≤ d(x, xn+1 ),

n = 1, . . . , N

(ties broken arbitrarily). The first K ∈ N points of such an enumeration, i.e., CK (x) := {(x1 , y1 ), (x2 , y2 ), . . . (xK , yK )} are called a K -neighborhood of x (in D). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 14 / 33 Machine Learning

Nearest Neighbor Regression and Classification Models The K -nearest neighbor regressor 1 yˆ (x) := K

X

y′

X

I(y = y ′ )

(x ′ ,y ′ )∈CK (x)

The K -nearest neighbor classifier 1 pˆ(Y = y | x) := K

(x ′ ,y ′ )∈CK (x)

and then predict the class with maximal predicted probability yˆ (x) := arg max pˆ(Y = y | x) y ∈Y

i.e., the majority class in the neighborhood. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 15 / 33

Machine Learning

Nearest Neighbor Regression Algorithm

1 2 3 4 5 6 7

predict-knn-reg(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × R, K ∈ N, d): allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) PK yˆ := K1 k=1 yCk return yˆ

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 16 / 33 Machine Learning

Nearest Neighbor Classification Algorithm

1 2 3 4 5 6 7 8 9

predict-knn-class(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × Y, K ∈ N, d): allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) allocate array pˆ of size |Y| for k := 1 : K : pˆCk := pˆCk + 1/K return pˆ

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 17 / 33

Machine Learning

Compute the argmin 1 2 3 4 5 6 7 8

argmin-k(x ∈ RN , K ∈ N) : allocate array T of size K for n = 1 : min(K , N ): insert-bottomk(T1:n , n, πx , 1) for n = K + 1 : N : if xn < xTK : insert-bottomk(T , n, πx , 0) return T

9 10 11 12 13 14

insert-bottomk(T ∈ X K , n ∈ X , π : X → R, s ∈ N) : k := find-sorted(T1:K −s , n, π) for l := K : k + 1 decreasing: Tl := Tl−1 Tk+1 := n Note: πx (n) := xn comparison by x-values. Here, X := N. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 18 / 33 Machine Learning

Compute the argmin / find (naive)

1 2 3 4 5

find-sorted-linear(x ∈ X K , z ∈ X , π : X → R) : k := K while k > 0 and π(z) < π(xk ): k := k − 1 return k

◮

requires ◮

◮

x is sorted (increasingly w.r.t. π )

returns smallest index k with π(xk ) ≤ π(z) ◮

0, if π(z) < π(x1 )

Note: Esp. for larger K it is better to use binary search. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 19 / 33

Machine Learning

Decision Boundaries For 1-nearest neighbor, the predictor space is partitioned in regions of points that are closest to a given data point: regionD (x1 ), regionD (x2 ), . . . , regionD (xN ) with regionD (x) := {x ′ ∈ X | d(x ′ , x ) ≤ d(x ′ , x ′′ )

∀(x ′′ , y ′′ ) ∈ D}

These regions often are called cells, the whole partition a Voronoi tesselation.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 20 / 33 Machine Learning

0.0

0.2

0.4

x2

0.6

0.8

1.0

Decision Boundaries

0.0

0.2

0.4

0.6

0.8

1.0

x1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 21 / 33

Machine Learning

Decision

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 21 / 33 Machine Learning

Outline

1. Distance Measures

2. K -Nearest Neighbor Models

3. Scalable Nearest Neighbor

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 22 / 33

Machine Learning

Complexity of K -Nearest Neighbor Classifier

The K -Nearest Neighbor classifier does not need any learning algorithm ◮ as it just stores all the training examples. On the other hand, predicting using a K -nearest neighbor classifier is slow: ◮

◮

◮

To predict the class of a new point x, the distance d(x, xi ) from x to each of the N training examples (x1 , y1 ), . . . , (xN , yN ) has to be computed. For a predictor space X := RM , each such computation needs O(M ) operations. We then keep track of the K points with the smallest distance.

In total one needs O(NM + NK ) operations. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 22 / 33 Machine Learning

Partial Distances / Lower Bounding In practice, nearest neighbor classifiers often can be accelerated by several methods. Partial distances: Compute the distance to each training point x ′ only partially, e.g., dr (x , x ′ ) := (

r X

1

′ 2 2, ) ) (xm − xm

m=1

r ≤M

As dr is non-decreasing in r , once dr (x , x ′ ) exceeds the K -th smallest distance computed so far, the training point x ′ can be dropped. This is a heuristic: it may accelerate computations, but it also may slow it down (as there are additional comparisons of the partial distances with the K smallest distance). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 23 / 33

Machine Learning

Nearest Neighbor Classification Algorithm 1 2 3 4 5 6 7

predict-knn-reg(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × R, K ∈ N, d): allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) PK yˆ := K1 k=1 yCk return yˆ

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 24 / 33 Machine Learning 1 2 3 4 5 6 7

1 2 3 4

predict-knn-reg(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × R, K ∈ N, d): allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) PK yˆ := K1 k=1 yCk return yˆ predict-knn-class(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × R, K ∈ N, d): C := π1 (argclos-k(q, x1 , x2 , . . . , xN , K )) PK yˆ := K1 k=1 yCk return yˆ

5 6 7 8 9 10 11

argclos-k(q ∈ RM , x1 , . . . , xN ∈ RM , K ∈ N) : allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) return {(Ck , DCk ) | k = 1 : |C |} Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 24 / 33

Machine Learning

Find Neighbors / Without Lower Bounding

1 2 3 4 5 6 7 8 9 10

argclos-k(q ∈ RM , x1 , . . . , xN ∈ RM , K ∈ N) : allocate array T of size K for pairs N × R for n = 1 : min(K , N ): PM d := m=1 (qm − xn,m )2 insert-bottomk(T , (n, d ), π2 , 1) for n = K + 1 : N : PM d := m=1 (qm − xn,m )2 if d < π2 (TK ): insert-bottomk(T , (n, d ), π2 , 0) return T

Note: argclos-K returns the K points closest to q and their distances. π2 (n, d) := d comparison by second component (distance). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 25 / 33 Machine Learning

Find Neighbors / With Lower Bounding 1 2 3 4 5 6 7 8 9 10 11 12 13 14

argclos-k(q ∈ RM , x1 , . . . , xN ∈ RM , K ∈ N) : allocate array T of size K for pairs N × R for n = 1 : min(K , N ): PM d := m=1 (qm − xn,m )2 insert-bottomk(T , (n, d ), π2 , 1) for n = K + 1 : N : d := 0 m := 1 while m ≤ M and d < π2 (TK ): d := d + (qm − xn,m )2 m := m + 1 if d < π2 (TK ): insert-bottomk(T , (n, d), π2 , 0) return T Note: argclos-K returns the K points closest to q and their distances. π2 (n, d) := d comparison by second component (distance). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 25 / 33

Machine Learning

Search trees Search trees: Do not compute the distance of a new point x to...