Machine learning - Overfitting PDF

Title	Machine learning - Overfitting
Course	Introduction to Machine learning
Institution	University of Greenwich
Pages	4
File Size	93.8 KB
File Type	PDF
Total Downloads	18
Total Views	144

Preview

CLICK TO PREVIEW PDF

Summary

Download Machine learning - Overfitting PDF

Description

Machine learning - Overfitting If we got a new example whose target classification is wrong, then the result will be a tree that performs well on errorful training examples, but less well on new unseen instances. Adapting to this noisy training data is one type of overfitting. Overfitting can also occur when the number of training examples is too small to be representative of the true target function, and coincidental regularities can be picked up during training. More precisely, we say that given a hypothesis space H, a hypothesis h ∈ H overfits the training data if there is another hypothesis h′ ∈ H such that h has a smaller error than h′ over the training data, but h ′ has a smaller error over the entire distribution of instances. Overfitting is a real problem for decision tree learning, with one empirical study showing a 10-25% decrease in accuracy over a range of tasks. Other machine learning tasks also suffer from the problem of overfitting. There are two approaches to avoiding overfitting: stopping growing trees before it perfectly fits the training data (e.g., when the data split is not statistically significant), or to grow the full tree and then prune it afterwards. In practice, the second approach is more successful. For either approach, we have to find the optimal final tree size. Ways to do this include using a set of examples distinct from the training examples to evaluate the quality of the tree, using all the data for training but then applying a statistical test to decide whether expanding or pruning a given node is likely to improve performance over the whole instance distribution, or measuring the complexity of encoding the exceptional training examples and the decision tree, and to stop growing the tree when this size is minimised (the minimum description length principle). The first approach is the most common and is called the training and validation set approach. Available instances are divided into a training set (approximately 2/3rds of the data) and a validation set (commonly around 1/3rd of the data) with the hope that random errors and coincidental regularities learned from the training set will not be present in the validation set.

With this approach, assuming the data is split into the training and validation sets, we train the decision tree on the training set, and then until further pruning is harmful, for each decision node we evaluate the impact on the validation set of removing that node and those below it, and then remove the node that most improves the accuracy on the validation set. The impact of removing a node is evaluated by removing a decision node, and the subtree rooted at it is replaced with a leaf node whose classification is the most common classification of the examples beneath the decision node, and this new tree can be evaluated. To assess the value of reduced error pruning, we can split the data into 3 distinct sets: the training examples for the original tree, the validation examples for guiding the tree pruning, and further test examples to provide an estimate over future unseen examples. However, holding data back for a validation set reduces the data available for training. Rule post-pruning is perhaps the most frequently used method (e.g., in C4.5). This proceeds by converting the tree to an equivalent set of rules (by making the conjunction of decision nodes along each branch the antecedent of a rule, and each leaf the consequent), pruning each rule independently of the others, and then sorting the final rules into a desired sequence for use. To prune rules, any precondition (the conjunct in an antecedent) of a rule is removed, if it does not worsen rule accuracy. This can be estimated by using a separate validation set, or by using the training data but assuming a statistically-based pessimistic estimate of rule accuracy (C4.5 does the latter). By converting trees to rules before pruning, this allows us to distinguish different contexts in which rules are used, and to therefore treat each path through the tree differently (in contract to removing a decision node, which removes all the paths below it). Also, this removes the distinction between testing nodes near the root, and those near the leaves, avoiding the need to rearrange the tree should higher nodes be removed. Rules are often easier for people to understand too. Continuous-valued attributes The initial definition of ID3 was restricted to discrete-valued target attributes and decision node attributes.

We can overcome this second limitation by dynamically defining new discretevalued attributes that partition a continuous attribute value into a set of discrete intervals, i.e., for a continuous value A, we dynamically create a new boolean attribute Ac that is true if A > c and false otherwise. c is chosen such that it maximises information gain. This can be extended to split continuous attributes into more than 2 intervals. The information gain measure favours attributes with many values other those with few values - e.g., a date attribute would result in a tree of depth 1 that perfectly classifies the training examples but fails on all other data, because date perfectly predicts the target attribute for all training examples. Other attribute selection measures can be used to avoid this. One such measure is gain ratio: GainRatio(S,A) ≡Gain(S,A) / SplitInformation(S,A), such that SplitInformation(S,A) ≡ -Σi =c1 ( (|Si| / |S|) log2 (Si / |S|) ), where Si is a subset of S for which the c-valued attribute A has the value vi. SplitInformation is the entropy of S with regard to the values of A. This has the effect of penalising attributes with many uniformly distributed values. Experiments with variants of this and other attribute selection techniques have been carried out and are reported in maching learning literature. Another refinement to consider is in the case of missing values for some attribute in a training example. Several ways of handling this has been explored: At the decision node n where Gain(S,A) is computed, then either:  assign the most common value of A among the other examples sorted to node n, or;  assign the most common values of A among other examples at n with the same target attribute value as x, or;  assign the probability pi to each possible value vi of A estimated from the observed frequencies of values of A for the examples sorted to A, or;  assign the fraction pi of x distributed down each branch in the tree below n (this is technique used in C4.5). The last technique can be used to classify new examples with missing attributes (i.e., after learning) in the same fashion. Sometimes, different attributes can have different costs associated with acquiring their values (e.g., in medical diagnosis where different attributes have

different costs), so it is beneficial to learn a consistent tree with a low expected cost. Various approaches have been explored in which the attribute selection method is modified to include a cost term: e.g., Tan and Schlimmer's Gain 2(S,A) / Cost(A) or Nunez's (2Gain(S,A) - 1) / (Cost(A) + 1)w, where w ∈ [0,1] which determines the importance of the cost....