EBI Prediction of Medical Insurance Cost Group 3 PDF

Title	EBI Prediction of Medical Insurance Cost Group 3
Author	pruthviraj morbale
Course	Business Intelligence
Institution	Lamar University
Pages	27
File Size	1.3 MB
File Type	PDF
Total Downloads	92
Total Views	127

Preview

CLICK TO PREVIEW PDF

Summary

Project report...

Description

LAMAR UNIVERSITY

Prediction of Medical Insurance Cost

Enterprise Bus Intelligence-5382 Group no 3 Spring 2019 Guided by Dr. Yueqing Li Lokesh Kolhe -L20443237 ([email protected]) Vikrant Singh -L20451075 ([email protected]) Pruthviraj Morbale -L20478184 ([email protected])

Introduction: In data mining, we work on large data and need to process it simple form, so it can be very well understood. Based on that that solution you can analyze, classify or predict future outcomes. R-programming is base component for data mining technique. medical insurance is primal concern for people nowadays. In day to day life medical insurance hold great importance because insurance cost is not affordable for everyone. Due to that several people miss basic healthcare insurance. Also, medical checkup and treatments become complicated for insurance company if patients have history of illness or medical issues. There are no proper basic attributes define to calculate insurance. Private and government insurance have different criteria when it comes to insurance and it is also depending upon region and individuals. Data mining plays crucial role in analysis. Nowadays by using patient data, habits and illness information data mining algorithm used to predict patient health outcomes. Several attempts have been made to accurately predict healthcare disease but without data mining algorithms results are not reliable. Lots of medical insurance companies have same cost for people and primary criteria they use to decide insurance cost is age of customer. There are other attributes plays vital purpose to predict someone’s health or insurance cost. Because some people have more than two dependents and some people have less than two dependents. So, due to that insurance cost should be vary. There are two kinds of people, first the people who do smoking and second non-smoker. Due to that people who smokes, their health will get more affected than other’s and their smoking is liability to their health as well as insurance company. A higher body mass index patient can have more chance of getting ill than healthy people. So, if everyone’s health condition is different then medical insurance should vary too. Region is basic variable to predict outcome which included in this study. In this research mutual beneficial path was studied and outcome provided which will help patients as well as insurance companies. This will help patients to submit their claims and it will become easier for insurance company to process it. So, data mining techniques has been used on patient’s basic healthcare data to predict insurance cost for per individual. There were so many data mining technique has been used so far but Multiple linear regression hasn’t been used to predict the results. So, in this research we are implementing MLR to predict insurance cost for people.

Objective:   

Create the better model to predict the cost of health insurance by using multiple linear regression. To determine cost of the insurance varies depending upon variables in data set.

Literature Review: A thorough research has been carried out on data mining applications in predictions of health care system. Modern data mining tools implemented on big population dataset to check whether data mining methods can provide accurate prediction of patient’s health and medical insurance costs. Research shown that if past data can be cleaned then it shows significance forecast results. Highest cost patients and their primary cost information provide highest accuracy but if cost deviates then accuracy may be affected [1]. Logistics regression and decision tree data mining techniques used to demonstrate they can be used to predict public health outcomes and insurance policy information for hypertension. Results and performance of data mining techniques compared with each other to check accuracy [2]. Huge number of US individual are medically uninsured and facing hazardous consequences. Uninsured individuals have been increasing and they received less medical care and poor treatment as compared to insured individual has been studied by help of logistic regression [3]. To predict disease and medical cost new methodology were studied which provided promising results. The number of people who were affected by lifestyle related disease by using SeqLDA, which classify hierarchical sequential data into segments. SeqLDA predicted accurate results using timeseries healthcare data [4]. To analyze disparity in health insurance data mining approach were taken to identify the factors to lead this phenomenon. Artificial neural networks and decision tree model were developed to analyzed importance of predictive factors. Income, employment status, education and marital status these were crucial predictive factors recoded in the study. With those results classification has been made that people who have healthcare coverage and those who don’t. [5]. Decreasing ozone layer concentration affecting to human health, sea level and increase in global warming. So, to get early warning of ozone layer affects multiple linear regression model implemented to predict hourly change in ozone layer concentration [6]. A very little research has been conducted in this area, even though medical insurance is vital part public health. But from research and public health data we can say that data mining algorithms can provide optimum results to predict public health.

Data Description:    

Census data of public health processed from public source https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv This data includes of 6 variables of 1338 adults of age between 18 years to 64 years. 662 females and 676 male’s data are available to forecast the results. Dependents/children are varying from 0 to 5. Cost of medical insurance is dependent variable and Age, Sex, BMI, Children, Smoking habit and Region is independent variables.

Methodology: Multi Linear Regression is used to build the model for the prediction. In multi linear regression the dependent variable is determined from independent variables and it’s intercepts as follows, Y =β0 +β1X1 +β2X2 +...........+βkXk+ε In this equation Y is the dependent variable which is cost of the health insurance in this project and X1, X2, X3, ……. Xk are the independent variables which are age, sex, bmi, children, smoker, region in this project. β0, β1, β2 are the intercepts. Data is split into training and validation randomly into ratio 70:30. According to the cost the data is split as it is dependent variable, 70% of the values from cost are trained and remaining 30% are valid data.

Model Building: Basic Model: In the dataset there are six independent variables and one dependent variable. At first the basic model is created. According to the cost the data is split as it is dependent variable, 70% of the values from cost are trained and remaining 30% are valid data. In the first basic model for cost of the insurance is based on all the independent variables. The accuracy was 74.82% and sex and region lie outside of the confidence interval, that means they are not that significant variables in this regression. Improved Model: To get the best model for this dataset we have tried various models to get more accuracy. After removing the insignificant variables and interactions and running the model30 we get more accurate results, the accuracy increased to 84.2%. To check the correlation of the independent variables with the dependent variable ggplot() + boxplot() was plotted. Based on the better correlation prediction plot of age, sex, region, children, bmi were plotted.

Results: Model

R-Squared Value

RMSE Value

RSE

P- Value Basic

< 2.2 e-16

0.75

3.458

6062

Improved

< 2.2 e-16

0.842

3.004

4890

For basic model

For improved model

Prediction Plots:

Discussion: After processing the data, model been made by using multiple linear regression in which basic health details need to be input and you will get future insurance cost of that individual. As per the dataset we received promising results. In future if there are more variables data available in the research such as income, drinking, marital status, diabetes, profession and past major illness then we will get more accurate results in predicting medical insurance for an individual. With these kinds of data, we can also predict an individual health for upcoming time.

Conclusion: 

   

After comparing both models, we get the value of R2 from the basic model1 is 74.82 and improved model30 is 84.2. So, we can say that model 2 provides most accurate results. This shows powerful tool for the prediction of health insurance cost. From the ggplot() and the boxplot() we can conclude that variable smoker is more correlated compared to other variables. For non-smokers the health insurance cost increases as the age increases and for smokers we don’t see any dependency in the plot. Based on the scatter plot of cost vs bmi it was found that the the individual who has the bmi of 30 needs more insurance cost. The algorithm we developed based on modern data mining technique of multiple linear regression for this data set providing most accurate results, we may get different results by another data mining technique in future.

Reference: 1) Bertimas, D., Kane, M.A., Bjarnadottir, M.V., Bjarnadottir, M., Cryder, J., Pandey, R., Vempala, S., 2008, Algorithmic prediction of healthcare costs, Operations Research, Vol. 56, No.6. 1382-1392. 2) Chae, Y.M., Ho, S.H., Chao, K.W., Lee, D.H., 2001, Data mining approach to policy analysis in health insurance domain, International journal of medical informatics, Vol. 62, Issue. 2, 103-111. 3) Hadly, J., 2007, Insurance coverage, medical care use and short-term health changes following on unintentional injury or the onset of chronic condition, JAMA, Vol. 297, No. 10. 4) Nagata, M., Matsumoto, K., Hashimoto, M., 2016, Prediction for disease risk and medical cost using time series healthcare data, Biomedical engineering system and technologies (BIOSTEC), Vol. 5, 517-522. 5) Delen, D., Fuller, C., McCann, C., Ray, D., 2009, Analysis of healthcare coverage: A data mining approach, Expert system with applications, Vol. 36, Issue. 2, 995-1003. 6) Sousa, S.I.V., Martins, F.G., Alvim-Ferraz, M.C.M., Pereira, M.C., 2007, Multiple linear regression and artificial neural networks based on principal components to predict ozone concentration, Environmental Modeling and software, Vol. 22, 77-103.

Appendix: R Code: #import file project...