Assignment 4 - K means cluster, hierarchical cluster Professor Prasad PDF

Title	Assignment 4 - K means cluster, hierarchical cluster Professor Prasad
Course	Data Mining (Undergraduate)
Institution	George Washington University
Pages	12
File Size	808 KB
File Type	PDF
Total Downloads	26
Total Views	154

Preview

CLICK TO PREVIEW PDF

Summary

K means cluster, hierarchical cluster
Professor Prasad...

Description

Abigail Alpert Data Mining – Spring 2020 – Assignment 4 Question 1: Clustering Marketing to Frequent Fliers. The file EastWestAirlinesCluster.jmp contains information on 3999 passengers who belong to an airline’s frequent flier program. For each passenger the data include information on their mileage history and on different ways they accrued or spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics for the purpose of targeting different segments for different types of mileage offers. a.

Apply hierarchical clustering and Ward’s method. Make sure to standardize the data. Use the dendrogram and the scree plot, along with practical considerations, to identify the ‘‘best’’ number of clusters. How many clusters would you select? Why?

This is the original cluster using the criteria method above with the scree plot. There are several clusters and it seems that the more clusters there are the smaller the error. When looking at the CCC, the smallest change occurs between 2 and 3 clusters so I chose three clusters.

b.

What would happen if the data were not standardized?

If the data was not standardized than some of the variables would dominate and thus skew the clusters. Some of the variables have values of numbers less than 10 and other variables have numbers in the thousands. c. Explore the clusters to try to characterize them. Try to give each cluster a label. i.

Compare the cluster centroids (select Cluster Summary), and click on the lines to characterize the different clusters.

ii.

Cluster Summary

iii.

Save the clusters to the data table and use graphical tools and the Column Switcher.

I saved the clusters as seen in the screenshot below:

c.

To check the stability of the clusters, hide and exclude a random 5% of the data (you can partition using a random seed of 4279), and repeat the analysis. Does the same picture emerge?

Below are the dendrograms and summary plot of the new analysis. It does not seem that the same picture emerges. I used the same number of groups. Cluster three appears to be very different.

Seen in the saved clusters below, cluster two has many of the clusters labeled 3 when they are labeled cluster 1 in the first analysis.

d.

Use 𝑘 -means clustering with the number of clusters that you found above. How do these results compare to the results from hierarchical clustering? Use the built-in graphical tools to characterize the clusters.

I made three clusters as seen in the table below. These clusters also overlap as seen in the biplot. Cluster 3 (blue) overlaps with both cluster 1 and 2. Cluster 1 and 2 (red and green) do a good job of separating from each other. Method K Means Cluster Cluster 1 2 3

Count 1745 1548 505

Step 37

Criterion 0

NClu NCluster ster 3

CCC Best -12.101 Optimal CCC

Cluster Means Cluster 1 2 3

Balanc Balancee 42873.259 60242.2216 223047.857

Bonus_mil Bonus_miles es 8991.06017 11816.3159 60829.2376

Biplot

Select principal componentsPC 1PC 2 Eigenvalues

Days_since_enroll 2297.9106 5803.78488 5262.64554

1.5943696

0.8142256

0.5914048

Scatterplot Matrix

e.

Which clusters would you target for offers, and what types of offers would you target to customers in that cluster?

I would target cluster 3 because they have the most miles which customers should want to use. Additionally, they have been enrolled in the program longer so they might have forgotten they are in it or not used it in a long time.

Question 2: k Nearest Neighbors The file UniversalBank.jmp contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. Partition the data into training (60%) and validation (40%) sets using a random seed of 4279. a. Consider the following customer: Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education = 2, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1. Perform a k-NN classification with all predictors except ID and ZIP code using k = 10. How would this customer be classified? (Note: This analysis may take a few minutes.) It is predicted to be 0 (not accepting the loan).

Below is the analysis on this KNN model with k=10.

K Nearest Neighbors Personal Loan

Mosaic Plot Training

Validation

b. What is a choice of 𝑘 that balances between overfitting and ignoring the predictor information? As the number of K increases then underfitting local structure. When using less K then the model captures noise. Meanwhile, when using more K then the model becomes underfitted. Try to choose the number of K that has the lowest error rate in the validation data. c. Show the classification matrix for the validation data that results from using the best 𝑘.

K Count Miscla Misclassificatio ssificatio Miscla Misclassificatio ssificatio n Rat Ratee ns 1 2000 0.03700 74 * 2 2000 0.04500 90 3 2000 0.04050 81 4 2000 0.04550 91 5 2000 0.04850 97 6 2000 0.05300 106 7 2000 0.05400 108 8 2000 0.05200 104 9 2000 0.05600 112 10 2000 0.05700 114

The best K is the K=1 because it has the lowest misclassification rate. d. Consider the following customer: Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education = 2, Mortgage = 0, Securities Account = 0, CDAccount = 0, Online = 1 and Credit Card = 1. Classify the customer using the best k.

Using the best k the customer is classified as 0 so they do not accept the loan. f.

Repartition the data, this time into training, validation, and test sets (50%, 30%, 20%). Apply the 𝑘 –NN method with the 𝑘 chosen above. Compare the classification matrix of the test set with that of the training and validation sets. Comment on the differences and their reason.

With the new partition the best K in the training set it is 3, in the validation set it is 3, and in the test set it is 1. The KNN chose the best model to be three as shown in the confusion matrix. Since the training set is used to train the model it makes sense that it has the highest misclassification rate and then it is lower for validation and a little lower for the test set. This means KNN with K=3 is a good predictive model.

Question 3: Nave Bayes Classifier Personal Loan Acceptance. The file UniversalBank.jmp contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. In this exercise we focus on two predictors: Online (whether or not the customer is an active user of online banking services) and Credit Card (abbreviated CC below) (does the customer hold a credit card issued by the bank), and the outcome Personal Loan (abbreviated Loan below). 1. 2.

Check to make sure that the variables are coded as Nominal, and that none of the variables has the Value Labels column property (remove this column property if needed). Create a summary of the data using Tabulate, with Online as a column variable, CC as a row variable, and Loan as a secondary row variable. The values inside the cells should convey the count (how many records are in that cell).

CreditCa CreditCard rd 0 1

3.

Online N 0 128 1 209 0 61 1 82

Personal Loan Yes No Column % N Column % 26.67% 1300 28.76% 43.54% 1893 41.88% 12.71% 527 11.66% 17.08% 800 17.70%

N 0 0 0 1

Consider the task of classifying a customer that owns a bank credit card and is actively using online banking services. Looking at the tabulation, what is the probability that this customer will accept the loan offer? (This is the probability of loan acceptance (Loan = 1) conditional on having a bank credit card (CC = 1) and being an active user of online banking services (Online = 1)).

P(Loan=1 1 CC=1, Online =1)= 17.08% 4.

Create two tabular summaries of the data. One will have Loan (rows) as a function of Online (columns) and the other will have Loan (rows) as a function of CC .

CreditCa CreditCard rd 0 1

Online N 0 128 1 209 0 61 1 82

Personal Loan Yes No Column % N Column % 26.67% 1300 28.76% 43.54% 1893 41.88% 12.71% 527 11.66% 17.08% 800 17.70%

5.

Compute the following quantities [𝑃 (𝐴|𝐵) means ‘‘the probability of A given B’’]:

a.

𝑃 (𝐶𝐶 = 1|𝐿𝑜𝑎𝑛 = 1) (the proportion of credit card holders among the loan acceptors)= 12.71+17.08=29.79% 𝑃 (𝑂𝑛𝑙𝑖𝑛𝑒 = 1|𝐿𝑜𝑎𝑛 = 1)= 43.54+17.08= 60.62% 𝑃 (𝐿𝑜𝑎𝑛 = 1) (the proportion of loan acceptors) =

b. c.

(128+209+61+82)/ (128+209+61+82+1300+1893+527+800)= 480/5000=.096 d. 𝑃 (𝐶𝐶 = 1|𝐿𝑜𝑎𝑛 = 0)= 11.66+17.70= 29.36 3. 𝑃 (𝑂𝑛𝑙𝑖𝑛𝑒 = 1|𝐿𝑜𝑎𝑛 = 0)= 41.88+17.70= 59.58 f. 𝑃 (𝐿𝑜𝑎𝑛 = 0) = 1-.096= .904 g.

Use the quantities computed above to compute the naive Bayes probability 𝑃 (𝐿𝑜𝑎𝑛 = 1|𝐶𝐶 = 1, 𝑂𝑛𝑙𝑖𝑛𝑒 = 1).

Proportion of online =1 among loan acceptance * proportion of CC =1 among loan acceptance * proportion loan acceptance= 60.62% * 29.79% *.096 =.0173

Proportion of online =1 among non loan acceptance * proportion of CC=1 among non loan acceptance * proportion of non loan acceptance = .5958 * .2936 * .904= .1581 𝑃 (𝐿𝑜𝑎𝑛 = 1|𝐶𝐶 = 1, 𝑂𝑛𝑙𝑖𝑛𝑒 = 1) = .0173/(.0173+.1581)= .0986= 9.86% h.

Compare this value with the one obtained from the tabulation in (b). Which is a more accurate estimate of 𝑃 (𝐿𝑜𝑎𝑛 = 1|𝐶𝐶 = 1, 𝑂𝑛𝑙𝑖𝑛𝑒 = 1)?

The value computed using Naïve Bayes was much lower than the value previously estimating. Naïve Bayes is probably more accurate because it uses the multiplicative rule and it assumes that the variables are independent. Partition data on CC=1 and Online =1 and look at distribution of loan i.

Which of the entries in this table are needed for computing 𝑃 (𝐿𝑜𝑎𝑛 = 1|𝐶𝐶 =1, 𝑂𝑛𝑙𝑖𝑛𝑒 = 1)? In JMP, use Naive Bayes to compute the probability that 𝑃 (𝐿𝑜𝑎𝑛 = 1|𝐶𝐶 = 1, 𝑂𝑛𝑙𝑖𝑛𝑒 = 1). Compare this to the number you obtained in (e).

As seen in the prediction profiler my calculation was almost exact as what was calculated in the Naïve Bayes....