 THIS DOCUMENT CONTAINS QUESTIONS THAT REPRESENT THE SORT OF QUESTIONS THAT MIGHT APPEAR ON THE FINAL QUIZ FOR DATA MINING FOR BUSINESS ANALYTICS (MANAGERIAL). T  his document includes answers for some of the questions.  THESE ARE INTENDED TO REPRESENT THE FORMAT AND STYLE OF QUESTIONS, NOT NECESSARILY THE CONTENT.  The  first  part  contains  questions that are specifically associated with particular chapters of the DS for Biz book.  The second part thencontains questions that span multiple chapters of the DS for Biz book. N  B: On the Final Quiz, the questions will not be associated with particular chapters of the book.   

Chapters 1 & 2 

data analytic thinking, supervised vs unsupervised, the data mining process

 

 2.1 Multiple Choice Choice  In the following, choose the single best answer.  1) (True/False) We  can  build unsupervised data mining models when we lack labels for the target variable in the training data.  2) (True/False) For supervised data mining  the value of the target variable is known when the model is used.  3) (True/False) Estimating the probability ofafraudulenttransaction is an example of data mining.  4) (True/False)  Finding  the  most profitable customer is an example of an unsupervised learning task  5) (True/False) Finding the characteristics that differentiate my most profitable customers from my less profitable customers is an example of an unsupervised learning task.  6) (True/False) Choosing  which  customers  are most likely to leave is an example of the use of DM results.  1

7) (True/False) Discovering patterns of thedefaultsonautoloansis not an example of the model in use.  8) Which is not  a reason why data mining technologiesareattractingsignificant attention nowadays? a) There is too much data for manual analysis b) Data are difficult to transfer from databases c) Data can be a resource for competitive advantage d) Machine learning algorithms are easily available  9) Regression is distinguished from classification by: a) class probability estimation b) numerical attributes c)  numerical target variable d) hypothesis testing   2.2 Short Answer In the following, give brief answers (at most 2 sentences per question).  10)  What  is  a  leak  in  predictive  modeling? Are leaks really a problem? Give a brief example. A leak is a situation where  a  variable  collected in historical data gives information on the  target  variable—information  that appears in historical data but is not actually available when thedecisionhas to bemade. Itwill make us overestimate the predictive performance of our models. An example is predicting whether a customer will be a big spender knowing the categories/numbers of items they have purchased.         


Chapter 3 

feature selection with entropy and information gain, tree induction

 

3.1 Multiple Choice Choice   In the following, choose the single best answer:  1) (True/False) Induction reasons from general knowledge to specific facts.  2) Entropy a) is a measure of information gain b) is used to calculate information gain c) is a measure of correlation between numeric variables d) has a strong odor  3) (True/False)  In  a  classification tree, a non-leaf node is referred to as “decision node” because it allows us to give a class prediction.  4) (True/False)  Tree-structured  models cannot give us estimates of the probability of customer churn.  5) (True/False) In a classification  tree,  decision nodes can only ask questions about the attributes of the examples we want to classify.  3.2 Short Answer  In the following, give brief answers (at most 2 sentences per question).  6) What does it mean for one attribute to give information about another attribute? Give an  example  of  how  one  would find an  attribute that gives information about another attribute. What  it  means is that that the first attribute reduces the uncertainty about the second attribute. Example: old pirate (Book, p. 43-44)          3

 

Chapter 4 

linear discriminants, linear regression, logistic regression, SVMs

 

4.1 Multiple Choice Choice   In the following, choose the single best answer:  1) (True/False)  Support-Vector  Machines (SVMs)  approach classification problem by finding the widest possible bar that fits between points of two different classes.  2) Which of the following is not true about logistic regression: a) Logistic regression can be used topredict the probability of membership in a certain class. b) Logistic regression takes a categorical target variable in training data. c) A  logistic regressionrepresentstheoddsofclassmembershipasa linear function of the attributes. d) Logistic  regression requires numeric attributes and categorical attributes should be converted to numeric attributes.  3) Which of the following does not describe SVM (support vector machine)? a) SVMs are based on supervised learning b) SVM chooses the line to minimize the margin between two classes c) SVM can be applied when the data are not linearly separable  4.2 Short Answer  In the following, give brief answers (at most 2 sentences per question).  4) When we fit a parameterized numeric model to data, we find the optimal model parameters. What does this mean? By optimal parameters we mean the value of the parameters that best   fit   t he training data.  The  term “best  fit”  is used with respect to the objective function of our learning procedure; this translates to minimizing  an  error/loss/cost function (e.g. minimize the number  of  misclassified data  points, minimize the mean-squared error, minimize the negative log-likelihood).    4

4.3 Matching Matching In the following, choose the best matching for each set; each letter should be used once.  __ Logistic regression

a. numerical target variable not bounded

__ Support Vector Machines

b. decision nodes

__ Linear Regression

c. log odds

__ Classification Trees

d. widest margin

   c d a b                


Chapter 5 

cv, overfitting

 

5.1 Multiple Choice Choice   In the following, choose the single best answer:  1) (True/False) Cross-validation is used to estimate generalization performance  2) (True/False) Adding more complexity to a modelwill generallyincrease its performance on the training set.  3) (True/False)  Complex  models  generally  give  better  generalization performance than simple models  4) A fitting curve plots: a) True positive rate vs. false positive rate฀ b) True positive rate vs. false negative rate c) Generalization performance vs. size of training set d) G  eneralization performance vs. model complexity  5) Which is not a technique for reducing/avoiding overfitting in tree induction? a) choose largest improvement in information gain฀ b) stop growing tree based on the number of training examples at a leaf฀ c) select tree size based on validation data d) reduce tree size by cutting off branches and replacing them with leaves  6) Which is not a benefit of using cross-validation for model induction evaluation? a) It provides an estimate of generalization performance b) It provides statistics on estimated performance, so that we can understand how performance will vary across data sets c) I t’s quick to compute relative to other holdout methods d) It makes better use of limited data by using all data for both training and testing  7) Learning curves a) Are used to select an optimal parameter complexity b) Are equivalent to fitting curves c) Plot true positive rate vs false positive rate d) C  an illustrate whether obtaining more data would be a good investment e) Are shown for a given amount of training data 6

  8) More complex models a) have better predictive performance b) tend to overfit more c) are easier to train than simpler models d) are very interpretable  5.2 Short Answer  1) Using  a linear model that perfectly separates a set of data points with two labels is not always a good idea. Why is that? Give an example. 


Chapter 6 

similarity, neighbours, clusters

 

6.1 Multiple Choice Choice   In the following, choose the single best answer:  1) (True/False)  Evaluation  is more difficult for unsupervised data mining than supervised data mining  2) (True/False)  When  using  clustering a target variable does not have to be precisely defined at training time  3)  (True/False)  kNN  techniques  are computationally efficient in the “use” phase of predictive modeling.  4) (True/False)  In  the use phase, k-means classifies new instances by finding the k most similar training instances  and  applying a combination function to the known values of their target variables  5) (True/False)  A  2-nearest  neighbor  model  is more likely to overfit than a 20-nearest neighbor model (cf. Chapter 5). 6) Similarity measures are most essential for a) Naïve Bayes b) Tree Induction c) H  ierarchical Clustering d) Logistic Regression  7) Which is not true of k-Nearest Neighbor (k-NN)? a) It can incorporate domain knowledge b) I t builds a simple induced model c) It is robust to noisy data d) It is easy to explain how it works   


6.2 Short Answer 8) Distance is a key notion  underlying  many data  mining algorithms, such as k-nearest neighbor  (k-NN).  What  problem  is  there  with  comparing consumers using regular Euclidean distance, for example when they are described by age (in years), income (in dollars), and number of credit cards? How can this problem be fixed?  9) Similarity is a key notionunderlyingmanydatamining techniques. If you use Euclidean distance  to  find  similar  examples, how can  you deal with categorical attributes? The k-nearest-neighbor technique estimates thetargetvariablebasedonthe kmost similar examples.  How  exactly  would you  estimate the target variable for a regression problem?  Explain  the pros and cons of using different values for k, for example k=1 and k=N, where N is the total number of training examples. How would you choose k?  10)  Evaluation  for  clustering can be challenging; briefly discuss two different ways to understand the meaning of the clusters found by k-means clustering.  11)  Give  an  example  where clustering can be used to improve business decisions. Explain briefly.    

 


Chapter 7  

7.1 Multiple Choice Choice   1) (True/False) The  error  rate  of a classifier is equal to the number of incorrect decisions made over the total number of decisions made.  2) A binary classifier achieves 95% accuracyonatestsetconsisting of 95% positive and 5%  negative  instances. If we use the same classifier on a test set composed of 50% positive and 50% negative instances, we expect to get: a) higher accuracy b) lower accuracy c) the same accuracy d) cannot be determined  7.2 Short answer  1) Two of your data scientists A andB are working on aprojectfor preliminary screening of a population ofpeople for theearlydetectionofProvost’sQuizinoma. Although very rare, this disease is deadly for the person bearing it if not identified in time, so your task  is  quite  important.  After  preliminary screening, a $750 blood test can determine the  presence of the disease with almost perfect accuracy. You decided to motivate your  analysts by structuring their work as a competition: both data scientists A and B have  to work independently on the problem and then present their results separately. After the competition period is over, on the test data, data scientist A reports 99.9% percent  correctly  classified instances from her model, while data scientist B reports only  86.3%  percent  correctly  classified instances from his model. Describe carefully how  you  would  determine  which algorithm is preferable? Illustrate with some hypothetical example numbers.  2) In  a  classification application we are asked to predict whether kids are going to be infected with the flu virus during 2018 or not, and if yes vaccinate them against it. The vaccine costs $10.If a child is vaccinated, there is only a 10% chance that she will be infected. If a kid gets infected, the cost of treatment is about $1000. Write down the cost-benefit matrix for the problem.   10






N 1000


  

7.3 Matching Matching  1)  ___ accuracy

a. TP/(TP+FP)

___ recall

b. TP/(TP+FN)

___ precision

c. 1 - (FP+FN)/(P+N)

   c b a               


Chapter 8  

8.1 Multiple Choice Choice   In the following, choose the single best answer: 1) The area under the ROC curve is not? a) equal to the Mann-Whitney-Wilcoxon statistic b) a measure of the quality of a model’s probability estimates c) l ikely to be at least 0.5 d) l arger when false positive errors cost more  1) (True/False) Adding a budget  constraint  to  the problem formulation might change the

choice of the best ranking classifier. 2) (True/False) A profit curve can assume negative values. 

1) The points on a model’s ROC curve a) represent the performance of different thresholds b) r epresent different rankings of examples c) represent the cost of different classifications  2) I want to rank credit applicants bytheirestimatedlikelihood of default. Which technique would be  least helpful in assessing the quality of a ranking model mined from data? (cf. prior chapters) a) holdout testing b) calculate area under the ROC curve

c) calculate percent correctly classified instances d) cross-validation e) domain knowledge validation   8.2 Short Answer  3) What exactly does the area under the ROC curve represent? Be as precise as possible.  The  area  under  the  ROC  curve  represents  the  probability that a randomly selected positive example will be ranked above a randomly selected negative example. This is the same as the Mann-Whitney -Wilcoxon statistic.

4) Give a shortexample of using the same model used in different contexts with different thresholds to make different decisions. An examplewould be a model that predicts the GPA a student will achieve. When used used on prospective students, we might want toset some threshold A to extend offers 12

to theonesabovethat threshold. When used on current students, we might want to set a different threshold B to discover the students performing poorly and offer them help (e.g. free tutoring).  5) Give  two  different reasons why using ROC curves  canbe  more effective for assessing model quality than the percent of classifications that are correct (a.k.a. "vanilla" accuracy). 6) Last month your boss sent a mailing to 20,000 of your existing customers with a special offer on a Hoosfoos credeen. The response was exciting: 1% of them responded, which brought in $200,000 in revenue. She has now delegated to you the task of continuing the program, and has given you a budget of $10,000, which will allow you to target another 20,000 customers (out of your customer base of 100,000). You don’t want to just target them randomly, as your boss did. You build a tree model and a logistic regression. Describe how to evaluate them as follows. Describe (a) the confusion matrix and (b) how you will fill it out for one of the models. Describe  (c)  the  cost/benefit matrix for this problem, including the costs and benefits for this case. (d) Show the evaluation function you will use to compare your systems. (e) How do (a) and (c) come into play in this evaluation function?       

   

 13

Chapter 9  

9.1 Multiple Choice Choice   1) You roll a trick 6-sided die twice. The trick is that the die has the same number on all sides. What is the conditional probabilitythatthe sum of the numbers that come up on the two rolls will be greater than 7 given that the first roll is 5 ? a) 1/3 b) 1/6 c) 2/3 d) 3/6 e) 6/6  9.2 Short Answer Answer  2) Explain why Naive Bayes is naive.   3) Explain  the  meaning of each of the different terms in Bayes Rule. Describe one way that this rule is used for data mining. Explanation of the different terms in chapter 9 of the book.  

 


Chapter 10  10.1 Multiple Choice Choice  1) (True/False) What is considered a stopword depends on the context of the textual data.  2) One key part of the data miningprocessis creating attributes to describe examples. In order to represent documents (such as emails) as examples, we create term (e.g., word) based attributes to describe  the  documents. Which of the following is not a common approach? a) whether or not the term appears in the document (binary attribute) b) term frequency (number of times term appears in document) c) term frequency/total number of terms in document d) t erm frequency times the term’s frequency in the document corpus  10.2 Short Answer Answer  3) The  word ‘good’ does not always imply positive sentiment in a review. Give an example. Describe a way that we can circumvent this problem. An  example  is  ‘The  movie  was not good’. Using 2-grams instead we can catch ‘not good’ as a different feature than ‘really good’. However this is approach is far from perfect - for example: ‘I can not understand why people say the movie is not good’.

 


Chapter 13   Q) You are on an interview where they notice that you've taken a data mining class. (a) They ask you about what you learned there, and besides talking about nitty-gritty modeling stuff, you want to give a bigger picture. Explain why it is important to think about data mining project strategically, with respect to making internal investments. What sort of investments might you have to make? (b) Now they're interested and ask you if you believe a firm can achieve sustained competitive advantage from data...

