Title | Ml 06 B1 nearest neighbor handout 2up |
---|---|
Author | maryam bahrami |
Course | Machine Learning |
Institution | Universität Hildesheim |
Pages | 24 |
File Size | 862.4 KB |
File Type | |
Total Downloads | 81 |
Total Views | 118 |
KNN, Nearest Neighbor Algorithm. Distance Measures...
Machine Learning
Syllabus Fri. 27.10. Fri. Fri. Fri. Fri. Fri. Fri. Fri. Fri. Fri.
3.11. 10.11. 17.11. 24.11. 1.12. 8.12. 15.12. 12.1. 19.1.
Fri. 26.1. Fri. 2.2. Fri. 9.2.
(1)
0. Introduction
(2) (3) (4) (5)
A. Supervised Learning: Linear Models & Fundamentals A.1 Linear Regression A.2 Linear Classification A.3 Regularization A.4 High-dimensional Data
(6) (7) (8) (9) (10)
B. Supervised Learning: Nonlinear Models B.1 Nearest-Neighbor Models B.2 Neural Networks B.3 Decision Trees B.4 Support Vector Machines B.5 A First Look at Bayesian and Markov Networks
(11) (12) (13)
C. Unsupervised Learning C.1 Clustering C.2 Dimensionality Reduction C.3 Frequent Pattern Mining
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 33 Machine Learning
Outline
1. Distance Measures
2. K -Nearest Neighbor Models
3. Scalable Nearest Neighbor
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 33
Machine Learning
Outline
1. Distance Measures
2. K -Nearest Neighbor Models
3. Scalable Nearest Neighbor
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 33 Machine Learning
Motivation So far, regression and classification methods covered in the lecture can be used for ◮ numerical variables, ◮
binary variables (re-interpreted as numerical), and
nominal variables (coded as set of binary indicator variables). often called scalar variables. ◮
Often one is also interested in more complex variables such as ◮ set-valued variables, ◮ ◮
sequence-valued variables (e.g., strings), ...
often called structured variables or complex variables. Note: A complex variable in this sense has nothing to do with complex numbers. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 33
Machine Learning
Motivation There are two kinds of approaches to deal with complex variables: I. feature extraction 1. derive binary or numerical variables, 2. then use standard methods on the feature vectors.
II. kernel methods 1. establish a distance measure between two values, 2. then use methods that use only distances between objects (but no feature vectors).
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 2 / 33 Machine Learning
Distance measures Let d be a distance measure (also called metric) on a set X , i.e., d : X × X → R+ 0 with 1. d is positiv definite: d(x, y ) ≥ 0 and d(x, y ) = 0 ⇔ x = y 2. d is symmetric: d(x, y ) = d(y , x ) 3. d is subadditive: d(x, z) ≤ d(x, y ) + d(y , z ) (triangle inequality) (for all x, y , z ∈ X .) Example: Euclidean metric on X := Rn : n 1
d(x, y ) := ( (xi − yi )2 ) 2 X i=1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 3 / 33
Machine Learning
Minkowski Metric / Lp metric
Minkowski Metric / Lp metric on X := Rn : with p ∈ R, p ≥ 1.
n X 1 d(x, y ) := ( p p |xi − yi | ) i=1
p = 1 (taxicab distance; Manhattan distance): d(x, y ) :=
n X i=1
|xi − yi |
p = 2 (euclidean distance): n X 1 d(x, y ) := ( (xi − yi )2 ) 2 i=1
p = ∞ (maximum distance; Chebyshev distance): n
d(x, y ) := max |xi − yi | i=1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 4 / 33 Machine Learning
Minkowski Metric / Lp metric / Example Example:
1 x := 3 , 4
2 y := 4 1
dL1 (x, y ) =|1 − 2| + |3 − 4| + |4 − 1| = 1 + 1 + 3 = 5 dL2 (x, y ) =
q
(1 − 2)2 + (3 − 4)2 + (4 − 1)2 =
√ √ 1 + 1 + 9 = 11 ≈ 3.32
dL∞ (x, y ) = max{|1 − 2|, |3 − 4|, |4 − 1|} = max{1, 1, 3} = 3
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 5 / 33
Machine Learning
Similarity measures Instead of a distance measure sometimes similarity measures are used, i.e., sim : X × X → R0+ with ◮
sim is symmetric: sim(x, y ) = sim(y , x).
Some similarity measures have stronger properties: ◮ ◮
sim is discerning: sim(x, y ) ≤ 1 and sim(x, y ) = 1 ⇔ x = y sim(x, z) ≥ sim(x, y ) + sim(y , z) − 1.
Some similarity measures have values in [−1, 1] or even R where negative values denote “dissimilarity”. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 6 / 33 Machine Learning
Distance vs. Similarity measures
A discerning similarity measure can be turned into a semi-metric (pos. def. & symmetric, but not necessarily subadditive) via d(x, y ) := 1 − sim(x, y ) In the same way, a metric can be turned into a discerning similarity measure (with values possibly in ] − ∞, 1]).
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 7 / 33
Machine Learning
Cosine Similarity
The angle between two vectors in RN can be used as distance measure hx, y i ) d(x , y ) := angle(x , y ) := arccos( ||x ||2 ||y ||2 To avoid the arccos, often the cosine of the angle is used as similarity measure (cosine similarity): sim(x, y ) := cos angle(x, y ) :=
hx, y i ||x ||2 ||y ||2
Example:
1 x := 3 , 4
2 y := 4 1
18 1·2+3·4+4·1 √ = √ √ ≈ 0.77 sim(x, y ) = √ 1 + 9 + 16 4 + 16 + 1 26 21 Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
b
8 / 33
Machine Learning
Distances for Nominal Variables 1. Binary variables: ◮ there is only one reasonable distance measure: d(x, y ) := 1 − I(x = y ) ◮
with I(x = y ) :=
1 if x = y 0 otherwise
This coincides with ◮
L∞ , 21 L1 and
√1 L2 2
distance
for the indicator/dummy variables. 2. Nominal variables (with more than two possible values): ◮ The same distance measure is useful. 3. Hierarchical variables (i.e., a nominal variable with levels arranged in a hierarchy) ◮ there are more advanced distance measures (not covered here). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 9 / 33
Machine Learning
Distances for Set-valued Variables For set-valued variables (which values are subsets of a set A) the Hamming distance often is used: d(x , y ) := |(x \ y ) ∪ (y \ x)| = |{a ∈ A | I(a ∈ x ) 6= I(a ∈ y )}| (= the number of elements contained in only one of the two sets). Example: d({a, e, p, l}, {a, b, n}) = 5,
d({a, e , p, l}, {a, e , g , n, o, r }) = 6
Also often used is the similarity measure Jaccard coefficient: sim(x, y ) := Example: sim({a, e, p, l}, {a, b, n}) =
1 , 6
|x ∩ y | |x ∪ y |
sim({a, e, p, l}, {a, e, g , n, o, r }) =
2 8
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 10 / 33 Machine Learning
Distances for Strings / Sequences edit distance / Levenshtein distance: d(x, y ) := minimal number of single character deletions, insertions or substitutions to transform x in y Examples: d(man, men) = d(house, spouse) = d(order, express order) =
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 11 / 33
Machine Learning
Distances for Strings / Sequences edit distance / Levenshtein distance: d(x, y ) := minimal number of single character deletions, insertions or substitutions to transform x in y Examples: d(man, men) =1 d(house, spouse) =2 d(order, express order) =8
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 11 / 33 Machine Learning
Distances for Strings / Sequences The edit distance is computed recursively. With x1:i := (xi ′ )i ′ =1,...,i = (x1 , x2 , . . . , xi ),
i ∈N
we compute the number of operations to transform x1:i into y1:j as c(x1:i , y1:j ) := min{ c(x1:i −1 , y1:j ) + 1, // delete xi , x1:i −1 y1:j c(x1:i , y1:j−1 ) + 1, // x1:i y1:j−1 , insert yj c(x1:i −1 , y1:j−1 ) + I (xi 6= yj )} // x1:i −1 y1:j−1 , substitute yj for xi starting from c(x1:0 , y1:j ) = c(∅, y1:j ) := j c(x1:i , y1:0 ) = c(x1:i , ∅) := i
// insert y1 , . . . , yj // delete x1 , . . . , xi
Such a recursive computing scheme is called dynamic programming. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 12 / 33
Machine Learning
Distances for Strings / Sequences Example: compute d(excused, exhausted). 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 y [j]/x[i] e x c u s e d d e t s u a h x e
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 13 / 33 Machine Learning
Distances for Strings / Sequences Example: compute d(excused, exhausted). d e t s u a h x e y [j]/x[i]
9 8 7 6 5 4 3 2 1 0
8 7 6 5 4 3 2 1 0 1 e
7 6 5 4 3 2 1 0 1 2 x
7 6 5 4 3 2 1 1 2 3 c
6 5 4 3 2 2 2 2 3 4 u
5 4 3 2 3 3 3 3 4 5 s
4 3 3 3 4 4 4 4 5 6 e
3 4 4 4 5 5 5 5 6 7 d
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 13 / 33
Machine Learning
Distances for Strings / Sequences Example: compute d(excused, exhausted). d e t s u a h x e
9 8 7 6 5 4 3 2 1 0
y [j]/x[i]
8 7 6 5 4 3 2 1 0 1 e
7 6 5 4 3 2 1 0 1 2 x
7 6 5 4 3 2 1 1 2 3 c
6 5 4 3 2 2 2 2 3 4 u
5 4 3 2 3 3 3 3 4 5 s
4 3 3 3 4 4 4 4 5 6 e
3 4 4 4 5 5 5 5 6 7 d
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 13 / 33 Machine Learning
Outline
1. Distance Measures
2. K -Nearest Neighbor Models
3. Scalable Nearest Neighbor
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 14 / 33
Machine Learning
Neighborhoods Let d be a distance measure. For a dataset D ⊆X ×Y and x ∈ X let
D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}
be an enumeration with increasing distance to x, i.e., d(x, xn ) ≤ d(x, xn+1 ),
n = 1, . . . , N
(ties broken arbitrarily). The first K ∈ N points of such an enumeration, i.e., CK (x) := {(x1 , y1 ), (x2 , y2 ), . . . (xK , yK )} are called a K -neighborhood of x (in D). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 14 / 33 Machine Learning
Nearest Neighbor Regression and Classification Models The K -nearest neighbor regressor 1 yˆ (x) := K
X
y′
X
I(y = y ′ )
(x ′ ,y ′ )∈CK (x)
The K -nearest neighbor classifier 1 pˆ(Y = y | x) := K
(x ′ ,y ′ )∈CK (x)
and then predict the class with maximal predicted probability yˆ (x) := arg max pˆ(Y = y | x) y ∈Y
i.e., the majority class in the neighborhood. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 15 / 33
Machine Learning
Nearest Neighbor Regression Algorithm
1 2 3 4 5 6 7
predict-knn-reg(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × R, K ∈ N, d): allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) PK yˆ := K1 k=1 yCk return yˆ
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 16 / 33 Machine Learning
Nearest Neighbor Classification Algorithm
1 2 3 4 5 6 7 8 9
predict-knn-class(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × Y, K ∈ N, d): allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) allocate array pˆ of size |Y| for k := 1 : K : pˆCk := pˆCk + 1/K return pˆ
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 17 / 33
Machine Learning
Compute the argmin 1 2 3 4 5 6 7 8
argmin-k(x ∈ RN , K ∈ N) : allocate array T of size K for n = 1 : min(K , N ): insert-bottomk(T1:n , n, πx , 1) for n = K + 1 : N : if xn < xTK : insert-bottomk(T , n, πx , 0) return T
9 10 11 12 13 14
insert-bottomk(T ∈ X K , n ∈ X , π : X → R, s ∈ N) : k := find-sorted(T1:K −s , n, π) for l := K : k + 1 decreasing: Tl := Tl−1 Tk+1 := n Note: πx (n) := xn comparison by x-values. Here, X := N. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 18 / 33 Machine Learning
Compute the argmin / find (naive)
1 2 3 4 5
find-sorted-linear(x ∈ X K , z ∈ X , π : X → R) : k := K while k > 0 and π(z) < π(xk ): k := k − 1 return k
◮
requires ◮
◮
x is sorted (increasingly w.r.t. π )
returns smallest index k with π(xk ) ≤ π(z) ◮
0, if π(z) < π(x1 )
Note: Esp. for larger K it is better to use binary search. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 19 / 33
Machine Learning
Decision Boundaries For 1-nearest neighbor, the predictor space is partitioned in regions of points that are closest to a given data point: regionD (x1 ), regionD (x2 ), . . . , regionD (xN ) with regionD (x) := {x ′ ∈ X | d(x ′ , x ) ≤ d(x ′ , x ′′ )
∀(x ′′ , y ′′ ) ∈ D}
These regions often are called cells, the whole partition a Voronoi tesselation.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 20 / 33 Machine Learning
0.0
0.2
0.4
x2
0.6
0.8
1.0
Decision Boundaries
0.0
0.2
0.4
0.6
0.8
1.0
x1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 21 / 33
Machine Learning
Decision
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 21 / 33 Machine Learning
Outline
1. Distance Measures
2. K -Nearest Neighbor Models
3. Scalable Nearest Neighbor
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 22 / 33
Machine Learning
Complexity of K -Nearest Neighbor Classifier
The K -Nearest Neighbor classifier does not need any learning algorithm ◮ as it just stores all the training examples. On the other hand, predicting using a K -nearest neighbor classifier is slow: ◮
◮
◮
To predict the class of a new point x, the distance d(x, xi ) from x to each of the N training examples (x1 , y1 ), . . . , (xN , yN ) has to be computed. For a predictor space X := RM , each such computation needs O(M ) operations. We then keep track of the K points with the smallest distance.
In total one needs O(NM + NK ) operations. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 22 / 33 Machine Learning
Partial Distances / Lower Bounding In practice, nearest neighbor classifiers often can be accelerated by several methods. Partial distances: Compute the distance to each training point x ′ only partially, e.g., dr (x , x ′ ) := (
r X
1
′ 2 2, ) ) (xm − xm
m=1
r ≤M
As dr is non-decreasing in r , once dr (x , x ′ ) exceeds the K -th smallest distance computed so far, the training point x ′ can be dropped. This is a heuristic: it may accelerate computations, but it also may slow it down (as there are additional comparisons of the partial distances with the K smallest distance). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 23 / 33
Machine Learning
Nearest Neighbor Classification Algorithm 1 2 3 4 5 6 7
predict-knn-reg(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × R, K ∈ N, d): allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) PK yˆ := K1 k=1 yCk return yˆ
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 24 / 33 Machine Learning 1 2 3 4 5 6 7
1 2 3 4
predict-knn-reg(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × R, K ∈ N, d): allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) PK yˆ := K1 k=1 yCk return yˆ predict-knn-class(q ∈ RM , D train := {(x1 , y1 ), . . . , (xN , yN )} ∈ RM × R, K ∈ N, d): C := π1 (argclos-k(q, x1 , x2 , . . . , xN , K )) PK yˆ := K1 k=1 yCk return yˆ
5 6 7 8 9 10 11
argclos-k(q ∈ RM , x1 , . . . , xN ∈ RM , K ∈ N) : allocate array D of size N for n := 1 : N : Dn := d(q, xn ) C := argmin-k(D, K ) return {(Ck , DCk ) | k = 1 : |C |} Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 24 / 33
Machine Learning
Find Neighbors / Without Lower Bounding
1 2 3 4 5 6 7 8 9 10
argclos-k(q ∈ RM , x1 , . . . , xN ∈ RM , K ∈ N) : allocate array T of size K for pairs N × R for n = 1 : min(K , N ): PM d := m=1 (qm − xn,m )2 insert-bottomk(T , (n, d ), π2 , 1) for n = K + 1 : N : PM d := m=1 (qm − xn,m )2 if d < π2 (TK ): insert-bottomk(T , (n, d ), π2 , 0) return T
Note: argclos-K returns the K points closest to q and their distances. π2 (n, d) := d comparison by second component (distance). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 25 / 33 Machine Learning
Find Neighbors / With Lower Bounding 1 2 3 4 5 6 7 8 9 10 11 12 13 14
argclos-k(q ∈ RM , x1 , . . . , xN ∈ RM , K ∈ N) : allocate array T of size K for pairs N × R for n = 1 : min(K , N ): PM d := m=1 (qm − xn,m )2 insert-bottomk(T , (n, d ), π2 , 1) for n = K + 1 : N : d := 0 m := 1 while m ≤ M and d < π2 (TK ): d := d + (qm − xn,m )2 m := m + 1 if d < π2 (TK ): insert-bottomk(T , (n, d), π2 , 0) return T Note: argclos-K returns the K points closest to q and their distances. π2 (n, d) := d comparison by second component (distance). Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 25 / 33
Machine Learning
Search trees Search trees: Do not compute the distance of a new point x to...