Exam 2017 PDF

Title	Exam 2017
Course	Data Mining
Institution	Aston University
Pages	11
File Size	224.5 KB
File Type	PDF
Total Downloads	38
Total Views	199

Preview

CLICK TO PREVIEW PDF

Summary

CS3440 Data MiningShow your working in all calculations.1. a) Classify the following attributes asbinary, discrete, orcontinuous. Also classify themasnominal, ordinal, intervalorratio. Some cases may have more than one interpre-tation, so briefly indicate your reasoning if you think there may be som...

Description

E SW AN CS3440 Data Mining

Show your working in all calculations.

1. a) Classify the following attributes as binary, discrete, or continuous. Also classify them as nominal, ordinal, interval or ratio. Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. Example: Age in years. Answer: Discrete, ratio i) Time in terms of AM or PM.

ii) Brightness as measured by a light meter.

iii) Bronze, Silver, and Gold medals as awarded at the Olympics. iv) Number of patients in a hospital.

v) Distance from the centre of campus.

(10 marks)

Model Answer:

(2 marks) for each sub-question. i)

Binary, ordinal

ii) Continuous, ratio iii) Discrete, ordinal iv) Discrete, ratio

v) Continuous, interval/ratio (depends)

RS

b) “There is no need to test a classifier. Computing the error rate on the training data gives us all the information that we need to know.” Comment on the truth of this statement using at least TWO examples from your experience of the coursework or labs on this module. (8 marks)

Model Answer:

1 OF 11

CS3440 Data Mining

Training data error rate is inherently biased above the true error rate; thus testing is essential if we want to use the classifier on new data and need an estimate of its likely performance. The quality of that estimate depends on the amount of test data available. Examples: nearest-neighbour always gets 100% on the test data, but this is not a realistic estimate of true error rate. Most models have a better training set performance than test set; if they have been trained carefully, this difference is minimised (e.g. naive Bayes on iris, soybean etc.). ( (4 marks) for initial discussion.

(2 marks) for each example.)

WE NS

c) Explain, using a diagram, how 10-fold cross-validation is carried out. When is crossvalidation a useful technique? (7 marks) Model Answer: Divide the dataset into ten disjoint segments of (roughly) equal size. Over ten iterations, each segment is chosen in turn to be the test set with the remaining nine segments used for training. (2 marks) The evaluation measure is computed for each test set and the average over all ten segments is used as the estimate of the true value. (2 marks)

(1 mark) for the figure.

Cross-validation is useful when the amount of training data is limited, so that a standard holdout dataset would yield broad confidence intervals for the error rate estimate. (2 marks)

RS 2 OF 11

CS3440 Data Mining

2. You are doing research on some unknown fruit species. In the samples given to you, some have been identified as poisonous, and others as not. You have the data as shown in the table below to consider whether the fruits A through H are poisonous. Answer the following questions. IsHeavy IsSmelly IsSpotted IsSmooth IsPoisonous 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1

WE NS

Sample A B C D E F G H

a) What is the entropy of “IsPoisonous”?

(5 marks)

Model Answer:

The equation for calculating entropy is Entropy(t) = −

Pn

j=1 p j

log2 (pj ).

(2 marks)

Entropy(IsPoisonous) = −3/8 × log 2 (3/8) − 5/8 × log2 (5/8) = 0.9544 (3 marks)

b) Calculate the information gain of EACH of the TWO attributes (“IsHeavy” and “IsSmooth”) for splitting the class (the “IsPoisonous” attribute). Which of these two attributes should you choose as the root of a decision tree? Why? (14 marks) Model Answer: Quality of split at node p into k partitions (children) is:

k X ni Entropy(i)) GAINsplit = Entropy(p) − ( n i=1

(3 marks)

If split by “IsHeavy”, we will have

IsPoisonous 0 1 2 3 1 2

RS

(2 marks)

IsHeavy 0 1

For the branch “IsHeavy = 0”, Entropy = −(2/5) log 2 (2/5) − (3/5) log2 (3/5) = 0.971

3 OF 11

CS3440 Data Mining

For the branch “IsHeavy = 1”, Entropy = −(1/3) log 2 (1/3) − (2/3) log2 (2/3) = 0.9183 GAINsplit = Entropy(IsP oisonous) − (5/8 × 0.971 + 3/8 × 0.9183) = 0.9544 − 0.9512 = 0.0032 (3 marks)

IsSmooth 0 1

If split by “IsSmooth”, we will have

IsPoisonous 0 1 2 2 1 3

WE NS

(2 marks)

For the branch “IsSmooth = 0”, Entropy = −(2/4) log 2 (2/4) − (2/4) log2 (2/4) = 1

For the branch “IsSmooth = 1”, Entropy = −(1/4) log 2 (1/4) − (3/4) log2 (3/4) = 0.8113

GAINsplit = Entropy(IsP oisonous) − (4/8 × 1 + 4/8 × 0.8113) = 0.9544 − 0.9057 = 0.0487

(3 marks)

Since the information gain using “IsSmooth” is larger than using “IsHeavy”, the attribute “IsSmooth” should be chosen for the split. (1 mark)

c) Explain early stopping in decision tree construction. What is a potential problem of early stopping? (6 marks)

Model Answer:

Early stopping means that we can stop growing a decision tree when the leaf nodes contain fewer than K items or when no attribute test yields an improvement in class purity. This prevents us from expanding leaf nodes (adding new attribute tests) when the sample sizes are too small to allow us to reliably pick a good test and assign class labels to the resulting leaf nodes. (3 marks)

A potential problem with early stopping is that it is difficult to judge if an attribute split is useful without seeing the splits that might be installed after it. For example, a combination of two attributes could yield good classification, but each of the two attributes in isolation provides no improvement in class purity. (3 marks)

RS 4 OF 11

CS3440 Data Mining

WE NS

3. a) Consider the data set shown below, predict the class label for a test sample (A = 0, B = 1, C = 0) using the Na¨ıve Bayes approach. Record A B C Class 1 0 0 0 + 2 0 0 1 − 3 0 1 1 − 4 0 1 1 − 5 0 0 1 + 6 1 0 1 + 7 1 0 1 − 8 1 0 1 − 9 1 1 1 + 10 1 0 1 + (10 marks)

Model Answer:

We need to compare P (Class = +|A = 0, B = 1, C = 0) and P (Class = −|A = 0, B = 1, C = 0) P (Class = +|A = 0, B = 1, C = 0) =

P (A = 0, B = 1, C = 0|Class = +) × P (Class = +) P (A = 0, B = 1, C = 0)

Since the denominator P (A = 0, B = 1, C = 0) is the same for P (Class = +|Attributes) and P (Class = −|Attributes), it can be ignored. (3 marks)

P (A = 0, B = 1, C = 0|Class = +) × P (Class = +) = P (A = 0|Class = +) × P (B = 1|Class = +) × P (C = 0|Class = +) × P (Class = +) = 0.4 × 0.2 × 0.2 × 0.5 = 0.008

(3 marks)

Similarly,

P (A = 0, B = 1, C = 0|Class = −) × P (Class = −) = P (A = 0|Class = −) × P (B = 1|Class = −) × P (C = 0|Class = −) × P (Class = −) = 0.6 × 0.4 × 0 × 0.5 = 0

(3 marks)

P (Class = +|A = 0, B = 1, C = 0) > P (Class = −|A = 0, B = 1, C = 0), therefore, the value for Class is +. (1 mark)

RS

b) Consider the following cost matrix:

5 OF 11

Actual

CS3440 Data Mining

Predicted A B C A 0 50 20 B 10 0 10 C 5 30 0

WE NS

If a classification model predicts A with probability 0.4, B with probability 0.2 and C with probability 0.4, what is the optimal (i.e. lowest cost) decision? Comment on the result. (6 marks)

Model Answer:

In the example, we assume that the predicted probabilities are correct. So classifying the example as A has an expected cost of 10 × 0.2 + 5 × 0.4 = 4. Classifying the example as B has an expected cost of 50 ×0.4+30 ×0.4 = 32. Classifying the example as C has an expected cost of 20 × 0.4 +10 ×0.2 = 10. Thus the lowest cost decision is to classify the example as class A. (4 marks) In this example, the predicting the outcome with the lowest probability gave rise to the largest cost. (2 marks)

c) Consider the following algorithm of finding the K nearest neighbours of a data object. for i = 1 to number of data objects do Find the distance of the ith object to the given object 3: end for 4: Sort these distances in decreasing order 5: Return the objects associated with the first K distances of the sorted list

1:

2:

i) Describe the potential problems with this algorithm if there are duplicate objects in the data set. Assume the distance function will only return a distance of 0 for objects that are the same. (5 marks) Model Answer: There are several problems. First, the order of duplicate objects on a nearest neighbour list will depend on details of the algorithm and the order of objects in the data set (2 marks) . Second, if there are enough duplicates, the nearest neighbour list may consist only of duplicates (2 marks) . Third, an object may not be its own nearest neighbour (1 mark) .

ii) How would you fix this problem?

(4 marks)

Model Answer:

RS

There are various approaches depending on the situation. One approach is to to keep only one object for each group of duplicate objects (2 marks) . In this case, each neighbour can represent either a

6 OF 11

CS3440 Data Mining

single object or a group of duplicate objects (2 marks) .

WE NS RS 7 OF 11

CS3440 Data Mining

4. a) Discuss the basic DIFFERENCE between the agglomerative and divisive hierarchical clustering algorithms and mention which type of hierarchical clustering algorithm is more commonly used. (6 marks) Model Answer: Agglomerative methods start with each object as and individual cluster and then incrementally builds larger clusters by merging clusters (2 marks) . Divisive methods, on the other hand, start with all points belonging to one cluster and then splits apart a cluster each iteration (2 marks) . The agglomerative method is more common. (2 marks)

WE NS b) Consider the data set shown below.

Transaction ID Item Bought 1 {a, d, e} 2 {a, b, c, e} 3 {a, b, d, e} 4 {a, c, d, e} 5 {b, c, e} 6 {b, d, e} 7 {c, d} 8 {a, b, c} 9 {a, d, e} 10 {a, b, e}

i) Compute the support for itemsets {e}, {b, d}, and {b, d, e} by treating each transaction ID as a market basket. (6 marks) Model Answer: (2 marks) for each equation.

Support({e}) =

Support({b, d}) =

Support({b, d, e}) =

8 = 0.8 10 2 = 0.2 10 2 = 0.2 10

(1) (2) (3) (4)

RS

ii) Use the results in part (i) to compute the confidence for the association rules {b, d} → {e} and {e} → {b, d}. Is confidence a symmetric measure? (5 marks) Model Answer: 8 OF 11

CS3440 Data Mining

(2 marks) for each equation. 0.2 =1 0.2 0.2 = 0.25 Confidence(e → bd) = 0.8 Confidence(bd → e) =

(5) (6) (7)

No, confidence is not a symmetric measure (1 mark) .

WE NS

c) Compare and contrast model-based and proximity-based approaches for anomaly detection. (8 marks)

Model Answer:

The model-based approach can be used with virtually any underlying technique that fits a model to the data (1 mark) . However, note that a particular model, statistical or otherwise, must be assumed. Consequently, model-based approaches are restricted in terms of the data to which they can be applied (2 marks) . For example, if the model assumes a Gaussian distribution, then it cannot be applied to data with a non-Gaussian distribution (1 mark) . On the other hand, the proximity-based approaches does not make any particular assumption about the data (1 mark) . They can be used for virtually any type of data, although the proximity metric used must be chosen appropriately (2 marks) . For example, Euclidean distance is typically used for dense, low-dimensional data, while the cosine similarity measure is used for sparse, high-dimensional data (1 mark) .

RS 9 OF 11

CS3440 Data Mining

5. A mobile network company receives many emails each day from its customers about fault reporting, bills enquiry, service complaint, upgrade request, etc. You have been hired as a data mining expert to help build an intelligent email handling software system that has the following functionality: filtering emails based on their priorities (e.g., some emails need to be attended immediately) and automatically directing the emails to the right person in the customer service department. In this context, answer the following questions: a) What assumptions do you need to make about this data mining task?

(4 marks)

WE NS Model Answer:

Assume the company has a set of pre-defined categories relating to customers’ email enquires/reports, such as mobile phone fault, network fault, bills and payment, and phone upgrade request. Each category is associated with certain priority score. For example, emails relating to network fault have the highest priority and should be dealt with immediately. Also, staff in the customer service department have clear responsibilities on email handling. (4 marks)

b) What are the set of features that you will extract from each email? What are the two most important? (6 marks)

Model Answer:

• Features to be extracted from emails include: headers including sender’s name and email, subject, date/time; email body content; signatures if any which may include sender’s address and contact telephone number (3 marks) . • The most important feature is the email body content as it gives the detailed information. Email subject line is also important since it often summarises what the email is about. Other features such as sender’s name, email, address, contact telephone number, and email date/time are useful in building customer records (3 marks) .

c) How would you represent each of the features extracted from emails? What preprocessing steps would you consider? (8 marks) Model Answer: • Information extracted from email headers or signatures can be populated to a structured database. Email content can be represented as word vectors with each element being an index in a vocabulary weighted by word frequency or TFIDF (term frequency - inverse document frequency) (4 marks) . • For pre-processing, we can perform the following steps. i) remove punctuations and unwanted text such as HTML tags;

RS

ii) convert all words to lower-case letters;

iii) remove stop words, such as function words or common words such as “the”, “a” and

10 OF 11

CS3440 Data Mining

“you”. iv) stemming: convert words into their root form. (4 marks)

WE NS

d) How would you apply data mining algorithms to this application? State the reasons for your approach. (7 marks) Model Answer:

If we don’t have any historical email records, we can manually label some emails by the pre-defined categories such as fault reporting, bill enquiry, and service complaint. With the initial set of labeled data, we can train a text classifier such as Na¨ ıve Bayes or Support Vector Machines to learn from the labeled data. (4 marks) Whenever a new email message arrives, we can then use the classifier to assign it to one of the categories. This would allow emails to be automatically filtered based on their priorities and also directed to the right person in the customer service department (3 marks) .

END OF EXAMINATION PAPER

RS 11 OF 11...