Classification Algorithms in Machine Learning PDF

Title	Classification Algorithms in Machine Learning
Author	Vishal Sharma
Pages	15
File Size	552.5 KB
File Type	PDF
Total Downloads	351
Total Views	879

Preview

CLICK TO PREVIEW PDF

Summary

Journal of Machine Learning Research 1 (2000) 1-48 Submitted 4/00; Published 10/00 Survey of Classification Algorithms and Various Model Selection Methods Vishal Sharma [email protected] Department of Physics Indian Institute of Technology Delhi Hauz Khas,New Delhi-110016, India Editor: Lesl...

Description

Accelerat ing t he world's research.

Classiﬁcation Algorithms in Machine Learning Vishal Sharma Template based Course Assignment (Not published in the Journal of Machine Learning)

Related papers

Download a PDF Pack of t he best relat ed papers 

Int roduct ion t o Machine Learning T he Wikipedia Guide osman omer

St at ist ics and Machine Learning in Pyt hon Release 0.2 ismail set iawan Spam Det ect ion for Mobile Short Messaging Service Using Dat a Mining Classiﬁers Journal of Comput er Science IJCSIS

Journal of Machine Learning Research 1 (2000) 1-48

Submitted 4/00; Published 10/00

Survey of Classification Algorithms and Various Model Selection Methods Vishal Sharma

[email protected]

Department of Physics Indian Institute of Technology Delhi Hauz Khas,New Delhi-110016, India

Editor: Leslie Pack Kaelbling

Abstract This report describes in a comprehensive manner the various types of classification algorithms that already exist. I will mainly be discussing and comparing in detail the major 7 types of classification algorithms here. The comparison will essentially be based on objective function, assumptions, advantages and draw backs of each. Keywords: Naive Bayes Classifier, Support Vector Machines(SVM), Linear Classifier, k-Nearest Neighbours(k-NN), Artificial Neural Networks(ANN’s), Quadratic Classifier

1. Introduction Classification is an important tool for the analysis of statistical problems. In machine learning or statistics, classification is referred to as the problem of identifying whether an object belongs to a particular category based on a previously learned model. This model is learned statistically based on a set of training data whose categorization is predefined. This method is known as supervised learning in Machine Learning. There are various types of classification techniques that are employed for learning these statistical models for a best possible result on the test data. All of these techniques possess different approaches to tackle the categorization and needs to be carefully selected by the individual in accordance to the need of the problem. In this paper, I will discuss some of the most common classification algorithms that are employed, various types of scoring schemes to measure their performance, Multi-class classification techniques and lastly the model selection techniques.

2. Survey of Various Classification Algorithms 2.1 Linear Classifiers Linear classifiers are a sub-type of classification algorithms that use of a linear combination of the characteristics of an object to make a decision on which category to place the object into. In general, these characteristics of an object are referred to as feature vectors. These feature vectors give us the required information on which class to put the object in question into.

c

2000 Vishal Sharma.

Classification Algorithms

2.1.1 Logistic Regression Logistic regression is a type of linear classifier. It is used to predict whether a given object would lie into the class ’1’ or class ’0’. This is the most common type of problem of a logistic regression classifier where the dependent variable is binary. Logistic regression can also be used for when there are more than two types of dependent variables. We use the multinomial logistic regression for these cases. The basis of classification used in logistic regression is the logistic function which is used to create a probability distribution corresponding to the weighted feature vectors. The logistic function is: hW (X) =

1 1+

e−W T X

(1)

In the above equation the the independent variable, ′ x′ is the feature vector with each component a weighted independent characteristic of the object along with a bias. The total input into the logistic function can thus be written mathematically as: X = WTX

(2)

where, ′ W ′ is a vector of associated weights inclusive of the bias term and ′ x′ is the ndimensional feature vector. W = (wn , wn−1 ...., w2 , w1, b)

(3)

X = (xn , xn−1 ...., x2 , x1 , 1)

(4)

For the case of logistic regression our objective function is the negative log likelihood function or the log likelihood function. This is the function that needs to be optimized in order to learn the decision boundary associated with the classification. The likelihood function is defined as follows: L(W, x) = P r(Y |X; W )

(5)

which is the multiplication of all individual probabilities since we assume all the feature vector components to be independent of each other. The actual objective function which is the negative log likelihood is the function which is defined as follows: − log L(W |x) = −

N X

log P r(yi |xi ; W )

(6)

yi log hW (x) + (1 − yi ) log(1 − hW (x))

(7)

i=1

− log L(θ|x) = −

N X i=1

Learning in logistic regression techniques corresponds to minimization of the above mentioned function, which is also known as the cost function. Again, as already mentioned the algorithm takes the leverage of certain assumptions for the ease of implementation. These include:

2

Classification Algorithms

• The observations are assumed to be independent of each other. This enables us to implement the probabilities of the measurements as a product of individual probabilities in the likelihood function. • The correlation amongst the independent variables should be fairly low. • It requires the relationship between the log probability and the independent variables to be linear. Below I talk about the various advantages and disadvantages of logistic regression: It does not assume linear relationship between the independent and the dependent variable. Both independent and the dependent variables need not be normally distributed. Handling non-linear effects is possible. It requires much more training data to be able to achieve meaningful results. 2.1.2 Naive Bayes Classifier They are a family of classifiers based on the probabilistic classification via the Bayes’ Theorem with strong independence between feature vectors. It is a popular method in practice for text categorization, for instance, classification of spam or non-spam emails, with the word frequencies as being the feature vectors. Despite their oversimplified design they find good applications in many real world problems. The Bayes is a conditional probability model which assigns the class ’Ck ’ to a feature vector ’x’ with the probability given as: P r(Ck |x1 , x2 , ....xn )

(8)

It can be seen clearly that this type of model will calculate the conditional probability for each class ’k’ based on the feature vector. This means that if the feature vector is high dimensional then this process is unfeasible. Therefore, we remodel the probability given in equation (8) as: P r(Ck |x) =

P r(Ck )P r(x|Ck ) P r(x)

(9)

Since the denominator is independent of ’C’ and since the values of feature vector are given, therefore it is a constant, we can focus on the numerator. The numerator can be broken into the following expression using the chain rule: P r(Ck , x1 , x2 , ....xn ) = P r(x1 , x2 , ....xn , Ck )

(10)

P r(Ck , x1 , ...xn ) = P r(x1 |x2 ...xn , Ck )P r(x2 |x3 ...xn , Ck )P r(x3 |x4 ...xn , Ck )...P r(xn−1 |xn |Ck ) (11) The model can finally be expressed as: P r(Ck |x1 , ...xn ) = P r(Ck )ΠN i=1 P r(xi |Ck ) 3

(12)

Classification Algorithms

The objective function is the posterior probability which is maximized during the process. The function likelihood, P r(x|Ck ) is composed initially. Then using the Bayes equation mentioned above is used to calculate the posterior probability. The class with the highest probability is the prediction. The posterior probability is given by, P r(Ck |x). This technique assumes the condition of independence among the feature variables. This can be considered as a disadvantage of the classifier as in most of the cases in case of real data, the assumption of independence can break down and can provide meaningless or untrustworthy results. However, this classifier is easy and fast and can perform well in multi-class classification. The shortcoming of this technique which is the assumption of independence holds within the data, it performs better than other classification algorithms and needs less training data. 2.2 Quadratic Classifier As the name suggests, a quadratic classifier learns a more general or more complex decision boundary for the separation of two or more classes. It is a generalisation of the linear classifier. The correct solution for a quadratic classifier is assumed to be quadratic in nature, whose ’y’ depends on the following term: xT Ax + W T x + b

(13)

The technique used for the calculation of the more complex quadratic decision boundary is based on the Quadratic Discriminant Analysis (QDA). The learning rule with this type of algorithm is pretty simple. We have a modelled quadratic discriminant function given as follows, which needs to be maximized. 1 1 δk (x) = − log(Σk ) − (x − µk )T Σ−1 k (x − µk ) + logπk 2 2

(14)

In the above equation the Σk is the covariance matrix and it is not identical. The learning will be based on the optimization of the aforementioned function. That is the classification rule can be mathematically represented as: G(x) = argmaxk δk (x)

(15)

This same function is the objective function and needs to be maximized. The classification works similarly. We need to determine the class ’k’ for which the δk (x) is maximized. The advantages of QDA are that it allows more flexibility and tends to fit the data well. But on the other hand, it requires more defining parameters. Since, we will have a distinct covariance matrix for every class, the problem could be unfeasible for a situation with many classes but not many data points. The QDA can be useful where the training data is big enough so that reducing the variance is not important. The classifier assumes that the observations under each of the classes are distributed in the form of a Gaussian. One example of this kind is a Gaussian Discriminant Analysis (GDA). 4

Classification Algorithms

2.3 Artificial Neural Networks (ANN’s) Artificial Neural Networks is another kind of an approach to solve machine learning problems. They are akin to the way a human brain processes information. Their structure constitutes of an artificial neuron and their corresponding weights that serve as the connections between two neurons. The fundamental working principle of ANN’s involve learning via adjusting the different weights between various neurons to learn the relationship between the dependent and the independent variables. The objective function in the case of a neural network is the sum-of-squares error function, which gives the network the information as to how incorrect or diverged the output is from the expected result. This information about the error is then used to remodel the weights in a manner corresponding to a further reduction in the error function. The aim of the learning process is to evaluate the error function at each iteration and re-adjust the weights in order to attain a local minima. The concept of a layer in a neural network is defined as the neurons and their corresponding weights residing in that layer. Consequently, for every neural network there are three types of layers, namely, Input layer, Hidden layers, Output layer. The neuron in every layer firstly computes the weighted input using the function, W T x and then applies the activation function on the weighted input to determine the output as either 0 or 1. There are various activation functions that are used, for instance, the ReLU function, the Signum function and the most common one, the sigmoid function given as 1 1+e−x

The most common type of neural network is the feed forward neural network in which the connections are only forward propagating and exist between the neurons of layer ’n’ and ’n+1’ with no backward connections. For learning purposes the error function is calculated corresponding to the current weights and consequently the weights are re-adjusted in order to decrease the error function, moving towards the minima. Backpropagation algorithm is used to implement error correction The error for a feed-forward neural network forms the objective function that needs to be minimized using the principle of gradient descent. Given below is the sum-of-squares error function: 1 E n = ΣK (yk − tk )2 (16) 2 k=1 This error function follows the criterion for Lyapunov’s stability and hence is called a Lyapunov’s function. The above mentioned error function is easy to optimize using a gradient descent technique for the adjustment of weights. The various advantages and disadvantages of neural networks are discussed here. The ANN’s are easy to use. They can provide a good fit for any function regardless of its nonlinearity. They find their best use for complex problems for example image recognition. On the other hand, the ANN’s are often used in places where simple linear regression can be implemented. They require a great amount of data to be trained sufficiently well. They are essentially a black box without allowing details to be studied. To increase the accuracy by a few percent, the size of the network needs to be scaled highly. It performs well even when the input data is noisy. 5

Classification Algorithms

2.4 Support Vector Machine (SVM) Support vector machine are a set of learning models in machine learning that are supervised in nature. The model is trained with a set of training data points that belong to either of the two classes of a binary classifier. Based on this, the SVM then implements the learned model on the newer data points belonging to the test sample and place them into either of the two classes. The SVM, thus, is a non-probabilistic binary classifier. The idea behind the classification is to implement an (n-1)-dimensional hyperplane to linearly classify n-dimensional feature vectors into two separate classes. It is trivial, that there can be infinitely many such hyperplanes that separate the data points(given they are linearly separable), we need to choose the hyperplane or the linear classifier that has the maximum margin. We define this margin as the distance of the nearest data points of the each class to this hyperplane. The objective function is this distance of the nearest points in both the classes to the hyperplane and the aim of the optimization is to maximize this. The SVM’s possess a parameter for regularization, which helps the optimization algorithm from overfitting the data. It can also fit the data non-linearly by using a technique called kernel-trick. Also it is based on a convex optimization problem which has very efficient methods for solving. The disadvantage that it possesses is that it learns the parameters corresponding to a given value of regularization and choice of kernel. 2.5 k-Nearest Neighbours (k-NN) k-Nearest Neighbours algorithm is a classification technique which is instance based and is also referred to as Lazy Learning. The basic idea of this type of classification is very simple. Although, the learnt model is ’k’ dependent, in essence, for different value of ’k’ a different model will be learnt. This algorithm assigns a class to a data point based on the classes of its k nearest neighbours. It picks the class that is most common among the k neighbours of the data point in picture. When, k=1, the problem becomes trivial and the classification is essentially based on the class of the nearest neighbour. Which means that the new data points which needs to be classified will be classified into the class to which its nearest neighbour point belongs to.

3. Summary of various scoring methods for classification 3.1 Accuracy This is the most common type of scoring method and essentially the most misused one. This type of method is viable for certain types of classification problems. It is calculated as the total number of correct predictions made over the total predictions (these predictions include correct classification and correct non-classification). If the problem in consideration has equal number of classifications in each of the classes and the case in which both the correct prediction of observation being present in the class and correct prediction on observation being absent from the class are equally important. Arguably, this type of nature of a classification problem can exist, but for the other types of problems, it is not a good idea to use this metric for the performance evaluation.

6

Classification Algorithms

To make in clear further, let us suppose an example problem in the medical diagnosis of a disease X. Assume that out of a population about 10% of the people are suffering from this disease. We design a faulty medical device that for the diagnosis of this disease for a random person out of this population. Our device, by the virtue of being faulty outputs negative for all everyone irrespective of whether they have that medical condition or not. In such a scenario, if we choose accuracy to be our scoring method, we get an accuracy of 90% over the given test population, even though our device failed miserably and did not diagnose the 10% of the people actually suffering from the disease. Mathematically, it can be presented as: Accuracy =

TP + TN TP + TN + FP + FN

(17)

where, TP/TN = True Positives/Negatives, FP/FN = False Positives/Negatives Total Predictions = TP + TN + FP + FN 3.2 Precision Precision is defined as the correctly predicted classifications over the total predicted classifications. High precision relates to the model’s low error in classifying the data points that do not belong to a certain class within that class. P recision = T P/(T P + F P )

(18)

with the abbreviations having their usual meaning. 3.3 Recall Recall is defined as the correctly predicted classifications over all the classifications of members of a certain class. The model gives the information as to how many objects that actually belong to the class in question get non-classified or get classified outside that class. Recall = T P/(T P + F N )

(19)

with the abbreviations having their usual meaning. 3.4 F1-Score F1 Score is another metric which at first sight is hard to understand intuitively. It essentially brings in both the False Positives and False Negatives to weigh in the error in decision making. It is defined as the Harmonic mean of precision and recall. Ideally, we would want to list all the true positive observations that exist for a particular class while being careful to omit all those who do not belong to that class. If we could do that, then we would have both high precision and high recall respectively. And this consequently will ensure, a high F1-Score corresponding to our model. Important thing to note here is, even if our precision is remarkably high, having a low recall will always dominate and bring down the F1-Score necessarily and vice-versa. 1 1 1 = + F 1 − Score P recision Recall 7

(20)

Classification Algorithms

F 1 − Score =

2 ∗ P recision ∗ Recall P recision + Recall

(21)

3.5 Area Under Curve(AUC) of a Receiver Operating Characteristics(ROC) Receiver Operating Characteristics (ROC) is a statistical curve that is a graphical plot of a characteristic of a model’s classification quality by varying its discrimination threshold. For instance, for mapping a probabilistic output to either the Class-1 or the Class-0 we consider, the case that if the probability of it being in the Class-1 is greater than 0.5 then classify it into Class-1, otherwise if it is less than ...