Exam 2019 PDF

Title	Exam 2019
Course	Data Mining
Institution	University of Computer Studies Yangon
Pages	8
File Size	416.9 KB
File Type	PDF
Total Downloads	70
Total Views	218

Preview

CLICK TO PREVIEW PDF

Summary

Download Exam 2019 PDF

Description

MCQ for Data Mining 1.Choose the correct answer. (20 marks) (1) Business Intelligence and data warehousing is used for ---------------. (A). Forecasting. (B). Data Mining. (C). All of the above. (2) Internet search engine are tasks related to the area of ---------------------------. (A) information retrieval (B) information cluster (C) information visualization (3) Integration requires a ------------------- step that ensures that only valid and useful results are incorporated into the decision support system. (A) pruning (B) postprocessing (C) preprocessing (4) Treating incorrect or missing data is called as ----------------. (A). selection. (B). preprocessing. (C). transformation. (5) Various visualization techniques are used in --------------step of KDD. (A). selection. (B). transformaion. (C). interpretation. (6) Two types of predictive modeling tasks:----------------------. (A) classification and regression (B) clustering and association (C) aggregation (7) Association analysis is used to discover patterns that describe ------------------ associated features in the data. (A) largely (B) fewer (C) strongly (8) Any subset of a frequent set is a frequent set. This is ------------------. (A). Upward closure property. (B). Downward closure property. (C). Maximal frequent set. (9) Which of the following best describes the standard deviation: (A). The average amount to which scores in a distribution differ from the mean (B). The variance multiplied by the range (C). the mean of the standardized scores (10) The ------------- step eliminates the extensions of (k-1)-itemsets which are not found to be frequent, from being considered for counting support. (A). Candidate generation.

Page 1 of 8

(B). Pruning. (C). Partitioning. (11) Correlation is: (A). the covariance of standardized scores (B). the mean of the population standard deviations (C). for comparing mean differences (12) Classification rules are extracted from -------------------. (A). root node. (B). decision tree. (C). siblings. (13) --------------- are designed to overcome any limitations placed on the warehouse by the nature of the relational data model. (A). Operational database. (B). Relational database. (C). Multidimensional database. (14) Capability of data mining is to build -------------- models. (A). retrospective. (B). interrogative. (C). predictive. (15) The full form of KDD is ------------------ . (A). Knowledge database. (B). Knowledge discovery in database. (C). Knowledge data house. (16) In --------------, the value of an attribute is examined as it varies over time. (A). Regression. (B). Time series analysis. (C). Sequence discovery. (17) The problem of dimensionality curse involves -----------------. (A). the use of some attributes may interfere with the correct completion of a data mining task. (B). the use of some attributes may simply increase the overall complexity. (C). All of the above. (18) Reducing the number of attributes to solve the high dimensionality problem is called as ---------------------. (A). dimensionality curse. (B). dimensionality reduction. (C). cleaning. (19) This unsupervised clustering algorithm terminates when mean values computed for the current iteration of the algorithm are identical to the computed mean values for the previous iteration. (A) agglomerative clustering (B) conceptual clustering (C) K-Means clustering

Page 2 of 8

(20) If a transactional dataset consist of 500000 transactions, 20000 transaction contain bread, 30000 transaction contain jam, 10000 transaction contain both bread and jam. Then the confidence of buying bread with jam is ---------------. (A). 33.33% (B). 66.66% (C). 50% 1. Answer ANY FIVE of the following: (6 marks each*5= 30 marks) (a). Define any TWO of the following data mining functionalities: Association, classification, prediction, characterization. Give examples of each data mining functionality, using a real-life database with which you are familiar. Answer: - Association is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. For example, a data mining system may find association rules like

where X is a variable representing a student. The rule indicates that of the students under study, 12% (support) major in computing science and own a personal computer. There is a 98% probability (confidence, or certainty) that a student in this group owns a personal computer. -Classification differs from prediction in that the former constructs a set of models (or functions) that describe and distinguish data classes or concepts, whereas the latter builds a model to predict some missing or unavailable, and often numerical, data values. Their similarity is that they are both tools for prediction: Classification is used for predicting the class label of data objects and prediction is typically used for predicting missing numerical data values. - Characterization is a summarization of the general characteristics or features of a target class of data. For example, the characteristics of students can be produced, generating a profile of all the University first year computing science students, which may include such information as a high GPA and large number of courses taken. (b). Matching Questions Determine which the best approach for each problem is. i) supervised learning ii) unsupervised clustering iii) data query 1.What is the average weekly salary of all female employees under forty years of age? 2. What attribute similarities group customers holding one or several insurance policies? 3. Do meaningful attribute relationships exist in a database containing information about credit card customers?

Page 3 of 8

4. 5. Matching Answers 1. iii 2. i

Do single men play more golf than married men? Determine whether a credit card transaction is valid or fraudulent. 3. ii 4. iii

5. i

(c). Given the following measurements for the variable age: 18, 22, 25, 42, 28, 43, 33, 35, 56, 28, Standardize the variable by the following: (i) Compute the mean absolute deviation of age. (ii) Compute the z-score for the first four measurements. Sol.) (i) Compute the mean absolute deviation of age. The mean absolute deviation of age is 8.8, which is derived as follows.

(ii) Compute the z-score for the first four measurements. According to the z-score computation formula,

(d). A group of sales price has been sorted as follows: 4,8,15,21,21,24,25,28,34 Partition them into three bins by each of the following methods: i.) Equal-frequency bin ii.)By bin boundaries iii.) By bin means Sol.) iii) By Bin-Means ii) By Bin-Boundary i) Equal-Frequency: Bin 1: 9, 9, 9 Bin 1: 4,4,15 Bin 1: 4,8,15 Bin 2: 22,22,22 Bin 2: 21, 21, 24 Bin 2: 21, 21, 24 Bin 3: 29,29,29 Bin 3: 25, 25, 34 Bin 3: 25, 28, 34 (e). Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. (i) Time in terms of AM or PM. Binary, qualitative, ordinal (ii) Brightness as measured by a light meter. Continuous, quantitative, ratio (iii) Brightness as measured by people’s judgments. Discrete, qualitative, ordinal

Page 4 of 8

(iv) Angles as measured in degrees between 0◦ and 360◦. Continuous, quantitative, ratio (v) Bronze, Silver, and Gold medals as awarded at the Olympics. Discrete, qualitative, ordinal (f). Describe the steps involved in data mining when viewed as a process of knowledge data discovery (KDD). The steps involved in data mining when viewed as a process of knowledge data discovery (KDD) are as follows: -Data cleaning, a process that removes or transforms noise and inconsistent data -Data integration, where multiple data sources may be combined -Data selection, where data relevant to the analysis task are retrieved from the database -Data transformation, where data are transformed or consolidated into forms appropriate for mining. -Data mining, an essential process where intelligent and efficient methods are applied in order to extract patterns -Pattern evaluation, a process that identifies the truly interesting patterns representing knowledge based on some interestingness measures -Knowledge presentation, where visualization and knowledge representation techniques are used to present the mined knowledge to the user (g). Suppose that the data for analysis includes the attribute age. The age values for the data tuples are: 13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70 i.) What is the mean, median, mode of the data? ii.) What is the midrange of the data? iii.) Can you find the first quartile (Q1) and the third quartile (Q3) of the data? Sol.) i) Mean=30 and Median=25, modes of the data are 25 and 35. ii.Mid drange= (13+70)/2=41.5 iii.) The first quartile is 20 and the third quartile is 35 2. Trace the results of using the Apriori algorithm on the fresh fruit sale shop example with support threshold s=33.34% and confidence threshold c=60%. Show the candidate and frequent itemsets for each database scan. Enumerate all the final frequent itemsets. Also indicate the association rules that are generated and highlight the strong ones, sort them by confidence. (15 marks) Transaction ID Items T1 Peanuts, Lemon, Grape T2 Peanuts, Lemon T3 Peanuts, Orange, Apple T4 Apple, Orange T5 Apple, Grape T6 Peanuts, Orange, Apple Solution: Support threshold =33.34% => threshold is at least 2 transactions. Applying Apriori Pass Candidate k-itemsets and their support Frequent k-itemsets

Page 5 of 8

(k) k=1 k=2

k=3 k=4

Peanuts(4), Lemon(2), Grape(2), Orange(3), Apple(4)

Peanuts, Lemon, Grape, Orange, Apple {Peanuts, Lemon}, {Peanuts, Lemon}(2), {Peanuts, Grape}(1), {Peanuts, Orange}, {Peanuts, Orange}(2), {Peanuts, Apple}(2), {Lemon,Grape}(1), {Lemon, Orange}(0),{Lemon, Apple} {Peanuts, Apple}, {Orange, Apple} (0), {Grape,Orange}(0),{Grape,Apple}(1),{Orange, Apple}(3) {Peanuts, Orange, Apple}(2) {Peanuts, Orange, Apple} {}

Note that {Peanuts, Lemon, Orange} and {Peanuts, Lemon, Apple} are not candidates when k=3 because their subsets {Lemon, Orange} and {Lemon, Apple} are not frequent. Note also that normally, there is no need to go to k=4 since the longest transaction has only 3 items. All Frequent Itemsets: {Peanuts}, {Lemon}, {Grape}, {Orange}, {Apple}, {Peanuts, Lemon}, {Peanuts, Orange}, {Peanuts, Apple}, {Orange, Apple}, {Peanuts, Orange, Apple}. Association rules: {Peanuts, Lemon} would generate: Peanuts ----> Lemon (2/6=0.33, 2/4=0.5) and Lemon ----> Peanuts (2/6=0.33, 2/2=1); {Peanuts, Orange} would generate: Peanuts ----> Orange (0.33, 0.5) and Orange ----> Peanuts (2/6=0.33, 2/3=0.66); {Peanuts, Apple} would generate: Peanuts ----> Apple (0.33, 0.5) and Apple ----> Peanuts (2/6=0.33, 2/4=0.5); {Orange, Apple} would generate: Orange ----> Apple (3/6=0.5, 3/3=1) and Apple ----> Orange (3/6=0.5, 3/4=0.75); {Peanuts, Orange, Apple} would generate: Peanuts ----> Orange ^ Apple (2/6=0.33, 2/4=0.5), Orange ----> Apple ^ Peanuts (2/6=0.33, 2/3=0.66), Apple ----> Orange ^ Peanuts (2/6=0.33, 2/4=0.5), Peanuts ^ Orange ----> Apple(2/6=0.33, 2/2=1), Peanuts ^ Apple ----> Orange(2/6=0.33, 2/2=1) and Orange ^ Apple ----> Peanuts(2/6=0.33, 2/3=0.66). With the confidence threshold set to 60%, the Strong Association Rules are (sorted by confidence): 1. Orange ----> Apple (0.5, 1) 2. Lemon ----> Peanuts (0.33, 1); 3. Peanuts ^ Orange ----> Apple(0.33, 1) 4. Peanuts ^ Apple ----> Orange(0.33, 1) 5. Apple ----> Orange (0.5, 0.75); 6. Orange ----> Peanuts (0.33, 0.66); 7. Orange ----> Apple ^ Peanuts (0.33, 0.66) 8. Orange ^ Apple ----> Peanuts(0.33, 0.66). 3. In the fields of science, engineering and statistics by using Naive Bays Classification, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's true value. Bike Damaged problem: Given the following table attributes are given such as color, type, origin and decision class can be ‘yes’ or ‘no’. It is necessary to classify a , is unseen sample, which is not given in the data set. Please show your classification problem solving step by step. Required formula: P (c | x) = P (c | x) P (c) / P (x), Where, P (c | x) is the posterior of probability, P (c) is the prior probability, P (c | x) is the likelihood, P (x) is the prior probability. (15 marks)

So the probability can be computed as: P (Yes) = 5/10 P (No) = 5/10

So, unseen sample X = P(X|Yes). P(Yes) = P(Blue|Yes). P(Indian|Yes). P(Sports|Yes). P(Yes) = 3/5*2/5*1/5*5/10 = 0.024 P(X|No). P(No) = P(Blue|No). P(Indian|No). P(Sports|No).P(No) = 2/5*3/5*3/5*5/10 = 0.072 So, 0.072 > 0.024 so sample can be classified as NO. 4. (a). Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): (i) Compute the Euclidean distance between the two objects. (ii) Compute the Manhattan distance between the two objects. (iii) Compute the Minkowski distance between the two objects, using p = 3. (10marks) Sol): (i) Compute the Euclidean distance between the two objects.

Page 7 of 8

(ii) Compute the Manhattan distance between the two objects.

(iii) Compute the Minkowski distance between the two objects, using p = 3.

(b). Given a one dimensional data set {2, 4, 5, 9, 10}, it has been divided into two clusters {2, 4, 5} and {9, 10}, use single, complete and average links with Euclidean distance to calculating the distances between them, respectively. (10 marks) Answer: In order to calculate the distance between two clusters, we need to know the distance between any two objects in different clusters. Therefore, we first calculate the distance with Euclidean distance to find the distance matrix of this data set as follows: For the single link, we need to find the shortest distance; i.e., d({2, 4, 5}, {9, 10} ) = min{ d(2,9), d(2, 10), d(4, 9), d(4, 10), d(5,9), d(5, 10)} = min{ 7, 8, 5, 6, 4, 5} = 4 For the complete link, we need to find the longest distance; i.e., d({2, 4, 5}, {9, 10} ) = max{ d(2,9), d(2, 10), d(4, 9), d(4, 10), d(5,9), d(5, 10)} = max{ 7, 8, 5, 6, 4, 5} = 8 For the average link, we need to find the averaging distance; i.e., d({2, 4, 5}, {9, 10} ) = [d(2,9), d(2, 10), d(4, 9), d(4, 10), d(5,9), d(5, 10)]/6 = [ 7 + 8 + 5 + 6 +4 + 5]/6 = 5.9

********************THE END***********************

Page 8 of 8...