Introduction to Data Analytics Assignment 3 PDF

Title	Introduction to Data Analytics Assignment 3
Author	Jayden Lee
Course	Introduction to Data Analytics
Institution	University of Technology Sydney
Pages	22
File Size	1.2 MB
File Type	PDF
Total Downloads	63
Total Views	130

Preview

CLICK TO PREVIEW PDF

Summary

Got 85 for the Introduction to Data Analytics Assignment. ...

Description

Jayden Lee 13558782

Assignment 3: Data Mining in Action. 31250: Introduction to Data Analytics

Contents Data Mining Problem and Scenario:.................................................................................................. 2 Data Pre-Processing and Preparation: ............................................................................................. 2 Data Mining Method:............................................................................................................................ 4 Binarization:....................................................................................................................................... 4 Classification & Prediction: ............................................................................................................. 4 Decision Tree:............................................................................................................................... 4 Gradient Boosted Tree Learner: ................................................................................................ 6 Random Forrest Learner:............................................................................................................ 8 ...................................................................................................................................................... 10 Tree Ensemble Learner: ........................................................................................................... 10 Multi-Layer Perception Learner:............................................................................................... 12 Support Vector Machine:........................................................................................................... 13 Conclusion:.......................................................................................................................................... 14 Appendix:............................................................................................................................................. 16 SVM Python Code:......................................................................................................................... 16 Classification Building and Machine Learning Workflow:......................................................... 19 References: ......................................................................................................................................... 22

1

Jayden Lee 13558782

Data Mining Problem and Scenario: Following the last assignment task, this data set is information related to insurance data. This assignment requires the performance of data analytics and mining on the data set in order to predict whether each quote was bought. Based off the data set provided, a multiple classifier will be built and optimised and scored to determine which classifier achieve the best results. Therefore, this report will follow the CRISP-DM data analytics methodology.

Data Pre-Processing and Preparation: Attribute Quote_ID

Transformation Dropped

Quote Date

Dropped

Purpose Quote ID as there should not have any correlation as it is a primary key in a data set and therefore it is dropped. Quote date can influence whether a quote was bought or not but in terms of our learners and classifiers, they only see numerical relationships. For this relationship to be present in classification, more pre-processing had to be done. I believe it wasn’t of critical important. Therefore, I’d rather drop the values and loose complexity.

2

Jayden Lee 13558782 All Values

Synthetic Minority OverSampling Technique (SMOTE)

Personal_Info5

Column Dropped

Property_Info2

Columns Dropped

Personal_Info1, Property_Info1, Geographic_Info4

Rows dropped

Field_Info1, Field_Info4, Coverage_Info3, Sales_Info4, Personal_Info1, Personal_Info3, Property_Info1, Property_Info3, Geographic_Info4, Geographic_Info5 All values after Category to number transformation.

Category to number

Normalisation

As the data is unbalanced, with most of the data being zeros. In order to balance it, under sample or oversample may be performed to equalise the values. I chose to over sample and generate new values from our previous values. Smote was used on the minority class to generate new values via clustering on the original data set. This can lose data resolution but in general has helped in predictions for the test set. In our data set. Due to the reasoning above, I think it was correct to over sample as the company would rather try to pressure a sale then lose a sale because we didn’t peruse it. (Richard 2017) More than 80% of the data in this column was missing. The missing data cannot assumed to equate 0 as it could be any number of values, and therefore, for the case of validity of the testers and not wanting to introduce bias, it was better to drop this column. As the values were all zero, it didn’t add anything to the data set so for building and training classifiers, therefore, it was not worth keeping in the set. To ensure significant bias wasn’t introduce if the row contained a missing value for this field, it was dropped, instead of being filled. For purposes of the SVM and Neural Network Trainer which are unable to deal with categorical data or string data types, it is imperative all categories are converted to numbers.

Normalization was performed as the data to work for the Knime’s Neural Network Node, the data must be normalised between zero and one. The reason why this isn’t done for all classifiers is that not all learners work well or more efficiently with float values and would rather have integer value inputs.

3

Jayden Lee 13558782

Data Mining Method: In order to achieve high generalised accuracy in terms of f-measure for both zero and one, data pre-processing must be done. It’s critical that we filter out rows or data entries that are in-correct and in-consistent so the rest of the data set can be used in analytics. Regarding the nature of the problem, which is determining whether quote was purchased, it’s only important to optimise the best classifier The following methodology was derived from “The 7 Most Important Data Mining Techniques” (Alton 2017) where relevant.

Binarization: Binarization was not performed as it wasn’t needed. While some classifiers better work with integer and not float values, with the classifiers chosen, it doesn’t seem to be the case. One of the other reasons why binarization isn’t used in my pre-processing is that the data set is in-complete. Meaning the column reader don’t provide information and context of what it is okay to use binarization on.

Classification & Prediction: In order to generate predictions, classifications must be done. Through Knime and Python’s Scikit Learn, there are multiple types of classifiers you can use. The following classifiers were used: Gradient Boosted Tree Learner, Random Forrest Learner, Tree Ensemble Learner, Decision Tree, Support Vector Machine Learner in Scikit Learn, Multi-Layer Perception Learner. These classifiers, in order to be trained and tested, they had to have the raw data partitioned into training data and test data. This split of data allows the classifiers to be tested and validated based off their prediction generated. The range of all the partitions of my data set is 70% for training and 30% for test. This split is the industry standard for most data sets and has produced the best results for prediction from experimentation and experience. Validation consist of certain metrics generated, like accuracy, F-measure or Cohen’s Kappa from the confusion matrix. However, this isn’t always the most valid way of determining if the classifier’s classification and prediction’s performance. In a skewed data set, a valid way of determining the accuracy of your prediction is F-measure. To help classifiers with the bias that would usually be encountered with skewed data, Synthetic Minority Over-sampling Technique has been performed. Then data is then used to train on. However, the testing and validation data is not oversampled. By oversampling the data, on average prediction the classifiers are producing better results in the magnitude of 20% increases on all metrics from experimentation. Subsequently, the best measure of whether our predictor is adequate is a mixture is f-measure and taking into consideration the ROC’s area under the curve. Additionally Cohen’s Kappa agree score, provides an adequate measure to understand consensus in our prediction of the data set. (how2stats 2019). In Knime, these statistics are derived from our Accuracy Statistic Node and will be presented via accuracy statistic node in Knime or via Sci Kit’s Metrics. Please note, additional ROC’s for the one value will be provided in the appendix.

Decision Tree: A singular decision tree is essential in understanding the other ‘tree’ classifiers and learners. These learners can work on both string values and numerical values however, in my testing, I’ve found better results with the converting categories to numbers.

4

Jayden Lee 13558782 Configuration

Justification

ROC Curve Accuracy Statistics Conclusion

Pruning is enabled as the data shouldn’t be over fit to the data set. In conjunction, the maximum record, being 25,000, produce the best results in terms of all measure and therefore, it was selected. Additionally, min record per node was chosen from experimentation, it produced the best result. NULL. The decision tree was unable to deliver a probability calculation and therefore was unable to provide a ROC Curve. Shown below for resolution purposes. The results shown below demonstrate that this learner preforms very strong for the 0 value and has a high f-score measure for the 0 value. However, this classifier still underperforms for the one value with an f score of 0.655, which makes it slightly better than random. This means that the classifier should be discarded.

5

Jayden Lee 13558782

Gradient Boosted Tree Learner: Configuration

Justification

To make sure the tree isn’t overfit, the selected values limit the number of levels/tree depth. While this increase accuracy, enabling this feature does help to ensure that the data isn’t overfit on the training data. Additionally, modifications and experimentation with Boosting Options, the best result deliver has derived limiting the number of models to 100.

6

Jayden Lee 13558782 ROC Curve

Accuracy Statistics Conclusion

Shown below for resolution purposes. For the model’s confidence in its prediction based off the ROC Curve for the value zero, it is highly confident. Additionally, with an accuracy of 85%, this classifier does well at predicting most of the data set. However, it still struggles with predicting instances of value one correctly and is only slightly worse than the decision tree with a F score of 65. Moreover, for the values which are predicted ‘one’, the classifier is not confident which an AUC of 0.19 in its result to predict the value one. Shown in the appendix. Again, this classifier, like the decision tree itself, has struggled to predict whether the result was the one value and should be discarded.

7

Jayden Lee 13558782

Random Forrest Learner: Configuration

Justification

Changing the split criterion between, Information Gain, Information Gain ratio and Gini Index and using the same static random seed, showed that information gain produced the best result via the F measure and all accuracy statistics.

ROC Curve

8

Jayden Lee 13558782

Accuracy Statistics Conclusion

Shown below for resolution purposes. The model is strong yet of ROC’s area under the curve, we can see that it isn’t confidence in its predictions. At 0.7892 for the zero value, which means it’s highly confident in its predictions. For the 1 value, the result is considerably worse, with values being returned as 0.21 (Narkhede 2018). This means that via ROC’s AUC this classifiers prediction are worthless, and a random classifier would do a better job. There is a debate in the scientific community in terms of using ROC to grade classifiers. While the scorer generated low probabilities and isn’t confident in its predictions for the 0 value, in terms of performance in accuracy statistics, this classifier performed the well. Additionally, F-Score for one is very strong and even in this skewed set, the classifier is producing strong predictions for both the 1 and 0 values, 0.975 and 0.9 respectively. This result should be used and evaluated to see if it is the best model produced.

9

Jayden Lee 13558782

Tree Ensemble Learner: Configuration

Justification

There are not many settings and configurations you can change in the Tree Ensemble Learner, the only thing you can do is filter out columns.

10

Jayden Lee 13558782 ROC Curve

Accuracy Statistics Conclusion

Shown below for resolution purposes. This model is very similar to the Random Forrest Learner in terms of ROC’s area under the curve, we can see that it isn’t confidence in its own predictions for the one value, show in the appendix. However, for the zero value, at 0.7618 we are more confident then random, and considered within the range of a useful model.(Narkhede 2018). However, as shown below in the accuracy statistic, this classifier preforms remark well with f-measure for both zero and one above 90%. This classifier is comparable to the Random forest learner in the accuracy of its predictions and should be considered to determine if it’s the best classifier.

11

Jayden Lee 13558782

Multi-Layer Perception Learner: Configuration

Justification

ROC Curve Accuracy Statistics Workflow

Via research, 3 hidden layers seems standard for a simple MLP Neural Network (sapo_cosmico 2018). As there are 3 hidden to ensure the model isn’t overfit. Additionally, 6 hidden neuron were chosen because I wanted the fastest processing time without compromising on learning ability (Sarle 2014). The results generated from this classifier are strong, with 78.4% accuracy as compared with previous results from experimentation with the node’s configuration. From the conjecture in the Machine Learning community, if the number of the hidden layers and hidden neurons per layer is in between the size of the input and output layers then it is acceptable. (sapo_cosmico 2018) NULL, No probability class was generated. Shown below for resolution purposes. Pre-processed was the same as all the other models, expect all the values needed to be normalised for MLP Prediction in Knime. Additionally, I chose to use min-max normalisation to make sure the values were in between zero and one which is optimal for a neural network (Sarle 2014). Following on, the data that was generated from the MLP were returned as decimal value and need to be binarized to 0 and 1. This was performed by a numerical binner, were, = 0.5 = 1.

12

Jayden Lee 13558782 Error Plot

Conclusion

From using multiple configurations with 1-3 hidden layers and 1-26 hidden nodes, this configuration can be considered strong. With additional iterations, it may be possible to get out of the error minimum encountered, shown in the error plot (Wilson 2010). While this result is strong for the 0 value and the F measure it high, this classifier fails to preform well on the f-measure with 1 value, only returning 0.582. In relation to the other classifiers, this is a very low performance and should be discarded.

Support Vector Machine: The support vector machine develop for this assignment was done in Python as I was having a hard time getting it to work inside of Knime. Attached in the Appendix is the code used for SVM learner. This code was derived from the scikit-learn and scikit modules. Additionally, in order to get them to work, numpy and pandas were used. It’s important to note that the data used to train the SVM was derived from Knime, and the pre-processing was preformed through Knime. This includes category to number and missing value removal. Configuration

kernel='rbf', gamma='scale', max_iter = 200, tol=0.001

13

Jayden Lee 13558782 Justification

ROC Curve

Accuracy Statistics

Conclusion:

(How each SVC Kernel Operates) (Scikit-Project 2019) It’s difficult to justify the results as the data set had a lot of dimensions, and it wasn’t generating an accurate prediction. The Support Vector Machine’s best results came from the RBF kernel which doesn’t use straight lines to classify values together, show in the image above. While RBF took longer to run, via the gamma = ‘scale’, the best results were generated. NULL. I was unable to deliver a probability calculation and therefore was unable to provide a ROC Curve via Sci-Kit learn. While I understand it is possible, at my current expertise, it was difficult to do. Accuracy:0.5229145278896523 Precision:0.5184982935153584 Recall:0.6016792522574853 F1 Score:0.5570003910833007 This was the worst classifier generated, in terms of the accuracy statistic and I was unable to generate results that came close to what was generated through Knime. This was the worst preforming classifier as was only slightly better then a random classifier and should be discarded.

Conclusion: A multitude of results were produced from unique classifiers, with strong results being generated from almost all the classifiers besides the Python SVM learner. Additionally, accuracy was high on all besides the SVM, with results above 75%. While accuracy is not a good measure, it does help to set a baseline on the data set to make sure it’s predicting values correctly. This eliminates the SVM learner as a strong model. Next, in terms of Fmeasure, the trainer shouldn’t have a bias and both measures should be equal. The only classifiers to reach this result was the Random Forrest and Tree Ensemble Learner. The only set to have F-measure above 90% was the Random Forrest learner and therefore it is 14

Jayden Lee 13558782 chosen as the best classifier produced. Additionally, it was the most confident in prediction, show from this model’s ROC’s AUC for both the zero and one value. However, more training and testing will have to be done in order validate this conclusion. The random forest learner was able to achieve these high results as it bootstraps both rows and columns to creates decision trees based off the random sampling. Each decision tree generated is given a vote and the outcome that the majority agree to, becomes the output for the prediction. This understanding allows you to see why the random forest leaner produced the best result.

15

Jayden Lee 13558782

Appendix: SVM Python Code: ######################################################################### #### # Author: Jayden Lee ([email protected]) # Date: 6/10/19 # Purpose: To use a SVM for our classification problem. # Source: # In Source Documentation ######################################################################### ####

######################################################################### ####

from sklearn import svm import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn import metrics import pickle import time ######################################################################### #### # Input: input # This function DOES # Output: input ######################################################################### ####

def SVMRegression(output, ...