DATA ANALYSIS - FINAL PAPER PDF

Title DATA ANALYSIS - FINAL PAPER
Course Management Information Systems and Analytics I
Institution Northern Alberta Institute of Technology
Pages 11
File Size 955.5 KB
File Type PDF
Total Downloads 52
Total Views 160

Summary

THE DATA ANALYSIS - PROJECT

ALL DETAILS AND EXCEL DOCUMENTS INCLUDED...


Description

Group 17

CMIS 2250 Data Analysis – Final Project

Explanation of the Process Goal: To predict which basketball teams would enter the playoffs and how many wins will they get for the seasons 2016-2017 & 2017-2018 using predictive and classification models.

Collection of Source Data The Source Data is collected from the “Miscellaneous Stats” table from the website https://www.basketball-reference.co/leagues/NBA_YYYY.html where year indicates an NBA season from seasons 2011 to 2016. The data is collected from the table and then stored in the Source Data worksheet. Preparation of Source Data For each season, the columns for rank, age, PW, PL, MOV, SOS, SRS, Arena, Attendance, Attendance./G are removed since it is not useful for our goal. Since the table have the same headers such as eFG%, TOV%, and FT/FGA on both Offense and Defence, it would be reasonable to rename them such as OeFG%, where O stands for offence, vice versa on D that stands on defence. Lastly, another column is added at the year ending to easily identify which season the teams played. Transformation of Source Data In order to transform the data into a normalized form for analysis, the data must be cleaned. This is already done before doing the next step. The information for teams that made it to playoffs was split into a column where 0 did not make it to the playoff and 1 made it to the playoffs. This would allow for the creation of a classification model to the data.

Creating Predictive Models After transforming the source data into its appropriate form (clean) and ready for modelling, the first step is to partition the data into sets using the Standard Partition Utility from the Data Mining tab. By using this method, you will be able to split the variables from the data list and include them in the partitioned data.

1

With the year-ending as nominal there is no effect on a team’s performance and thus should not be given any influence on the model. The data is partitioned using the default settings of 60% for training and 40% for the validation set automatic partitioning percentage as rows being picked up randomly in a worksheet called S TDPartition. STDPartition worksheet yields three different predictive models (Linear Regression, K-Nearest Neighbors, Regression Tree) wherein the output variable is selected to be the Win (W) column while the rest of the variables set as selected variables excluding Playoffs.

Model Choice Once the data was partitioned, validation scores of each model are used to compare their quality as the validation partition shows how well each model did on the new data. The scores  NNP_ValidationScore, and are stored in the automatically created L inReg_ValidationScore, K RT_ValidationScore worksheets, and we get the results displayed below: Linear Regression

k-Nearest Neighbors

Regression Tree

Regarding the 4 error metrics; SSE (Sum of Squared Errors), MSE (Mean of Squared Errors), RMSE (Root Mean of Squared Errors) and MAD (Mean Absolute Deviation), they each indicate the error measurement of the model where the lower the value the better the model. On the other hand, the R2 (Coefficient of Determination) metric measures the completeness of the model and it indicates how well it correlates the selected variables with the output where a higher value indicates a better model. Among the 3 different models, it is evident that the Regression Tree algorithm performed the worst on the partitioned data having the highest error values while the linear regression model performed the best with low error measures and has the highest value of R2. Therefore, it is relatively logical to use the linear regression model on the subject data.

Creating Classification Models A similar process is used to create classification models. The only difference is that the output variable is now Playoff while Win (W) is excluded from the selected variables. A probability cut-off by default of 0.5 or 50% is chosen not to minimize or maximize the overall cost of classification error but to optimize its balance and to achieve a desired sensitivity or specificity.

5

Model Choice For the classification models, there are 3 models (Logistic Regression, K-Nearest Neighbors, Classification Tree) that are chosen and again the validation scores are chosen for comparison. Below are their classification summaries from the LogReg_ValidationScore,  T_ValidationScore  worksheets. KNNC_ValidationScore, and C

6

By comparing the overall accuracy percentages of the following models it is clear that the K-Nearest Neighbors algorithm performed the worst among the three having the lowest Accuracy rate of 71.67%. It is also irrational to use the K-Nearest Neighbors model on the subject data because of its lesser accuracy rate that may greatly impact the overall correctness and completeness of the output. On the other hand, the Logistic Regression model would be the chosen model acquiring the highest accuracy rate of 83.33%.

7

Results The models in LinReg_Stored is used for predictive scoring while LogReg_Stored is used for classification scoring. The chosen models are scored against the subject data located in the Subject Data worksheet through matching the name of its columns. The worksheets Scoring_LinearRegression and S coring_LogisticRegression display the results for a number of predicted wins and playoffs. The next pages summarize the result on these worksheets.

8

9

Results Analysis Predictive Model Based on the results, we have concluded that the Predictive Model that best fits the data was the Linear Regression; acquiring the lowest error measures among the three models and has the highest coefficient of determination (R2) of 0.87893557. With the use of data partitioning, it is evident that the number of Wins (W) for the subject data is fairly accurate compared to the actual number of wins retrieved from Season 2017 and 2018. It is also observed that the Linear Regression model produces the smallest difference or sum of squared residuals. This means that the low error metrics (SSE, MSE, RMSE, MAD) of the model correlates an accurate and more precise prediction for the number of Wins (W) or the output accrued in a season. Classification Model The chosen classification model was Logistic Regression because it has the lowest error percentage by 16.67% compared to the other two tests, the KNN and Classification Tree; that produced 19.12% and 26.67% respectively. Another factor that affects the choice of model is the contribution of the low percentage of error is significantly relevant because of metric measurements like Specificity and Sensitivity. Whereas the sensitivity measures the correct values that were identified as true positives or false negatives. Hence, specificity measures the data that we thought were false are true and vice versa. As a result of these factors given, Logistic Regression test has the highest specificity and sensitivity with a percentage of 83.33% compared to the other two tests, namely KNN and Classification Tree. In a comparison of Logistic Regression to K-Nearest Neighbors that resulted in 73.2% of specificity and 77.2% of sensitivity. While the third test result percentages for Classification Tree regarding specificity and sensitivity resulted in 75% and 72.22% respectively. Possible Improvements Now that results from different models are disclosed, it can be inferred that both models are distinct from each other making it harder for both results to be accurately correct and reliable. In order to get more precise results, some modifications should be implemented in the models being used. For example, both models need improvement on its process of groupings a season’s teams into its desired field, such as proper division of players joining the playoff or non-playoff teams. It has been distinguished that the classification model fails to accurately sort the members of the team because it lacks consideration of other internal factors. Instead of doing so, the model only deliberates its conclusion of splitting the teams into non-defined groups on the provided statistics.

10...


Similar Free PDFs