Title | Exam 13 2017, questions and answers |
---|---|
Author | TSU TUNG KU |
Course | Data analysis |
Institution | 元智大學 |
Pages | 30 |
File Size | 491.5 KB |
File Type | |
Total Downloads | 98 |
Total Views | 132 |
Download Exam 13 2017, questions and answers PDF
THIS DOCUMENT CONTAINS QUESTIONS THAT REPRESENT THE SORT OF QUESTIONS THAT MIGHT APPEAR ON THE FINAL QUIZ FOR DATA MINING FOR BUSINESS ANALYTICS (MANAGERIAL). T his document includes answers for some of the questions. THESE ARE INTENDED TO REPRESENT THE FORMAT AND STYLE OF QUESTIONS, NOT NECESSARILY THE CONTENT. The first part contains questions that are specifically associated with particular chapters of the DS for Biz book. The second part thencontains questions that span multiple chapters of the DS for Biz book. N B: On the Final Quiz, the questions will not be associated with particular chapters of the book.
Chapters 1 & 2
data analytic thinking, supervised vs unsupervised, the data mining process
2.1 Multiple Choice Choice In the following, choose the single best answer. 1) (True/False) We can build unsupervised data mining models when we lack labels for the target variable in the training data. 2) (True/False) For supervised data mining the value of the target variable is known when the model is used. 3) (True/False) Estimating the probability ofafraudulenttransaction is an example of data mining. 4) (True/False) Finding the most profitable customer is an example of an unsupervised learning task 5) (True/False) Finding the characteristics that differentiate my most profitable customers from my less profitable customers is an example of an unsupervised learning task. 6) (True/False) Choosing which customers are most likely to leave is an example of the use of DM results. 1
7) (True/False) Discovering patterns of thedefaultsonautoloansis not an example of the model in use. 8) Which is not a reason why data mining technologiesareattractingsignificant attention nowadays? a) There is too much data for manual analysis b) Data are difficult to transfer from databases c) Data can be a resource for competitive advantage d) Machine learning algorithms are easily available 9) Regression is distinguished from classification by: a) class probability estimation b) numerical attributes c) numerical target variable d) hypothesis testing 2.2 Short Answer In the following, give brief answers (at most 2 sentences per question). 10) What is a leak in predictive modeling? Are leaks really a problem? Give a brief example. A leak is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when thedecisionhas to bemade. Itwill make us overestimate the predictive performance of our models. An example is predicting whether a customer will be a big spender knowing the categories/numbers of items they have purchased.
2
Chapter 3
feature selection with entropy and information gain, tree induction
3.1 Multiple Choice Choice In the following, choose the single best answer: 1) (True/False) Induction reasons from general knowledge to specific facts. 2) Entropy a) is a measure of information gain b) is used to calculate information gain c) is a measure of correlation between numeric variables d) has a strong odor 3) (True/False) In a classification tree, a non-leaf node is referred to as “decision node” because it allows us to give a class prediction. 4) (True/False) Tree-structured models cannot give us estimates of the probability of customer churn. 5) (True/False) In a classification tree, decision nodes can only ask questions about the attributes of the examples we want to classify. 3.2 Short Answer In the following, give brief answers (at most 2 sentences per question). 6) What does it mean for one attribute to give information about another attribute? Give an example of how one would find an attribute that gives information about another attribute. What it means is that that the first attribute reduces the uncertainty about the second attribute. Example: old pirate (Book, p. 43-44) 3
Chapter 4
linear discriminants, linear regression, logistic regression, SVMs
4.1 Multiple Choice Choice In the following, choose the single best answer: 1) (True/False) Support-Vector Machines (SVMs) approach classification problem by finding the widest possible bar that fits between points of two different classes. 2) Which of the following is not true about logistic regression: a) Logistic regression can be used topredict the probability of membership in a certain class. b) Logistic regression takes a categorical target variable in training data. c) A logistic regressionrepresentstheoddsofclassmembershipasa linear function of the attributes. d) Logistic regression requires numeric attributes and categorical attributes should be converted to numeric attributes. 3) Which of the following does not describe SVM (support vector machine)? a) SVMs are based on supervised learning b) SVM chooses the line to minimize the margin between two classes c) SVM can be applied when the data are not linearly separable 4.2 Short Answer In the following, give brief answers (at most 2 sentences per question). 4) When we fit a parameterized numeric model to data, we find the optimal model parameters. What does this mean? By optimal parameters we mean the value of the parameters that best fit t he training data. The term “best fit” is used with respect to the objective function of our learning procedure; this translates to minimizing an error/loss/cost function (e.g. minimize the number of misclassified data points, minimize the mean-squared error, minimize the negative log-likelihood). 4
4.3 Matching Matching In the following, choose the best matching for each set; each letter should be used once. __ Logistic regression
a. numerical target variable not bounded
__ Support Vector Machines
b. decision nodes
__ Linear Regression
c. log odds
__ Classification Trees
d. widest margin
c d a b
5
Chapter 5
cv, overfitting
5.1 Multiple Choice Choice In the following, choose the single best answer: 1) (True/False) Cross-validation is used to estimate generalization performance 2) (True/False) Adding more complexity to a modelwill generallyincrease its performance on the training set. 3) (True/False) Complex models generally give better generalization performance than simple models 4) A fitting curve plots: a) True positive rate vs. false positive rate b) True positive rate vs. false negative rate c) Generalization performance vs. size of training set d) G eneralization performance vs. model complexity 5) Which is not a technique for reducing/avoiding overfitting in tree induction? a) choose largest improvement in information gain b) stop growing tree based on the number of training examples at a leaf c) select tree size based on validation data d) reduce tree size by cutting off branches and replacing them with leaves 6) Which is not a benefit of using cross-validation for model induction evaluation? a) It provides an estimate of generalization performance b) It provides statistics on estimated performance, so that we can understand how performance will vary across data sets c) I t’s quick to compute relative to other holdout methods d) It makes better use of limited data by using all data for both training and testing 7) Learning curves a) Are used to select an optimal parameter complexity b) Are equivalent to fitting curves c) Plot true positive rate vs false positive rate d) C an illustrate whether obtaining more data would be a good investment e) Are shown for a given amount of training data 6
8) More complex models a) have better predictive performance b) tend to overfit more c) are easier to train than simpler models d) are very interpretable 5.2 Short Answer 1) Using a linear model that perfectly separates a set of data points with two labels is not always a good idea. Why is that? Give an example.
7
Chapter 6
similarity, neighbours, clusters
6.1 Multiple Choice Choice In the following, choose the single best answer: 1) (True/False) Evaluation is more difficult for unsupervised data mining than supervised data mining 2) (True/False) When using clustering a target variable does not have to be precisely defined at training time 3) (True/False) kNN techniques are computationally efficient in the “use” phase of predictive modeling. 4) (True/False) In the use phase, k-means classifies new instances by finding the k most similar training instances and applying a combination function to the known values of their target variables 5) (True/False) A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model (cf. Chapter 5). 6) Similarity measures are most essential for a) Naïve Bayes b) Tree Induction c) H ierarchical Clustering d) Logistic Regression 7) Which is not true of k-Nearest Neighbor (k-NN)? a) It can incorporate domain knowledge b) I t builds a simple induced model c) It is robust to noisy data d) It is easy to explain how it works
8
6.2 Short Answer 8) Distance is a key notion underlying many data mining algorithms, such as k-nearest neighbor (k-NN). What problem is there with comparing consumers using regular Euclidean distance, for example when they are described by age (in years), income (in dollars), and number of credit cards? How can this problem be fixed? 9) Similarity is a key notionunderlyingmanydatamining techniques. If you use Euclidean distance to find similar examples, how can you deal with categorical attributes? The k-nearest-neighbor technique estimates thetargetvariablebasedonthe kmost similar examples. How exactly would you estimate the target variable for a regression problem? Explain the pros and cons of using different values for k, for example k=1 and k=N, where N is the total number of training examples. How would you choose k? 10) Evaluation for clustering can be challenging; briefly discuss two different ways to understand the meaning of the clusters found by k-means clustering. 11) Give an example where clustering can be used to improve business decisions. Explain briefly.
9
Chapter 7
7.1 Multiple Choice Choice 1) (True/False) The error rate of a classifier is equal to the number of incorrect decisions made over the total number of decisions made. 2) A binary classifier achieves 95% accuracyonatestsetconsisting of 95% positive and 5% negative instances. If we use the same classifier on a test set composed of 50% positive and 50% negative instances, we expect to get: a) higher accuracy b) lower accuracy c) the same accuracy d) cannot be determined 7.2 Short answer 1) Two of your data scientists A andB are working on aprojectfor preliminary screening of a population ofpeople for theearlydetectionofProvost’sQuizinoma. Although very rare, this disease is deadly for the person bearing it if not identified in time, so your task is quite important. After preliminary screening, a $750 blood test can determine the presence of the disease with almost perfect accuracy. You decided to motivate your analysts by structuring their work as a competition: both data scientists A and B have to work independently on the problem and then present their results separately. After the competition period is over, on the test data, data scientist A reports 99.9% percent correctly classified instances from her model, while data scientist B reports only 86.3% percent correctly classified instances from his model. Describe carefully how you would determine which algorithm is preferable? Illustrate with some hypothetical example numbers. 2) In a classification application we are asked to predict whether kids are going to be infected with the flu virus during 2018 or not, and if yes vaccinate them against it. The vaccine costs $10.If a child is vaccinated, there is only a 10% chance that she will be infected. If a kid gets infected, the cost of treatment is about $1000. Write down the cost-benefit matrix for the problem. 10
P
N
P
110
10
N 1000
0
7.3 Matching Matching 1) ___ accuracy
a. TP/(TP+FP)
___ recall
b. TP/(TP+FN)
___ precision
c. 1 - (FP+FN)/(P+N)
c b a
11
Chapter 8
8.1 Multiple Choice Choice In the following, choose the single best answer: 1) The area under the ROC curve is not? a) equal to the Mann-Whitney-Wilcoxon statistic b) a measure of the quality of a model’s probability estimates c) l ikely to be at least 0.5 d) l arger when false positive errors cost more 1) (True/False) Adding a budget constraint to the problem formulation might change the
choice of the best ranking classifier. 2) (True/False) A profit curve can assume negative values.
1) The points on a model’s ROC curve a) represent the performance of different thresholds b) r epresent different rankings of examples c) represent the cost of different classifications 2) I want to rank credit applicants bytheirestimatedlikelihood of default. Which technique would be least helpful in assessing the quality of a ranking model mined from data? (cf. prior chapters) a) holdout testing b) calculate area under the ROC curve
c) calculate percent correctly classified instances d) cross-validation e) domain knowledge validation 8.2 Short Answer 3) What exactly does the area under the ROC curve represent? Be as precise as possible. The area under the ROC curve represents the probability that a randomly selected positive example will be ranked above a randomly selected negative example. This is the same as the Mann-Whitney -Wilcoxon statistic.
4) Give a shortexample of using the same model used in different contexts with different thresholds to make different decisions. An examplewould be a model that predicts the GPA a student will achieve. When used used on prospective students, we might want toset some threshold A to extend offers 12
to theonesabovethat threshold. When used on current students, we might want to set a different threshold B to discover the students performing poorly and offer them help (e.g. free tutoring). 5) Give two different reasons why using ROC curves canbe more effective for assessing model quality than the percent of classifications that are correct (a.k.a. "vanilla" accuracy). 6) Last month your boss sent a mailing to 20,000 of your existing customers with a special offer on a Hoosfoos credeen. The response was exciting: 1% of them responded, which brought in $200,000 in revenue. She has now delegated to you the task of continuing the program, and has given you a budget of $10,000, which will allow you to target another 20,000 customers (out of your customer base of 100,000). You don’t want to just target them randomly, as your boss did. You build a tree model and a logistic regression. Describe how to evaluate them as follows. Describe (a) the confusion matrix and (b) how you will fill it out for one of the models. Describe (c) the cost/benefit matrix for this problem, including the costs and benefits for this case. (d) Show the evaluation function you will use to compare your systems. (e) How do (a) and (c) come into play in this evaluation function?
13
Chapter 9
9.1 Multiple Choice Choice 1) You roll a trick 6-sided die twice. The trick is that the die has the same number on all sides. What is the conditional probabilitythatthe sum of the numbers that come up on the two rolls will be greater than 7 given that the first roll is 5 ? a) 1/3 b) 1/6 c) 2/3 d) 3/6 e) 6/6 9.2 Short Answer Answer 2) Explain why Naive Bayes is naive. 3) Explain the meaning of each of the different terms in Bayes Rule. Describe one way that this rule is used for data mining. Explanation of the different terms in chapter 9 of the book.
14
Chapter 10 10.1 Multiple Choice Choice 1) (True/False) What is considered a stopword depends on the context of the textual data. 2) One key part of the data miningprocessis creating attributes to describe examples. In order to represent documents (such as emails) as examples, we create term (e.g., word) based attributes to describe the documents. Which of the following is not a common approach? a) whether or not the term appears in the document (binary attribute) b) term frequency (number of times term appears in document) c) term frequency/total number of terms in document d) t erm frequency times the term’s frequency in the document corpus 10.2 Short Answer Answer 3) The word ‘good’ does not always imply positive sentiment in a review. Give an example. Describe a way that we can circumvent this problem. An example is ‘The movie was not good’. Using 2-grams instead we can catch ‘not good’ as a different feature than ‘really good’. However this is approach is far from perfect - for example: ‘I can not understand why people say the movie is not good’.
15
Chapter 13 Q) You are on an interview where they notice that you've taken a data mining class. (a) They ask you about what you learned there, and besides talking about nitty-gritty modeling stuff, you want to give a bigger picture. Explain why it is important to think about data mining project strategically, with respect to making internal investments. What sort of investments might you have to make? (b) Now they're interested and ask you if you believe a firm can achieve sustained competitive advantage from data...