Capstone Project- Final Busines Report PDF

Title	Capstone Project- Final Busines Report
Course	PostGRad
Institution	Indian Institute of Management Lucknow
Pages	15
File Size	698.2 KB
File Type	PDF
Total Downloads	22
Total Views	164

Preview

CLICK TO PREVIEW PDF

Summary

Course projects for end term presentations- Business Analytics...

Description

Capstone Project- Final Report

Manish verma

Index Introduction ................................................................................................................... 3 Defining Problem Statement Business Problem Need for the Project Scope of Project Understanding business/social opportunity

Exploratory Data Analytics ...........................................................................................3 Understanding how data was collected in terms of time, frequency and methodology Visual inspection of data (rows, columns, descriptive details

Insights From EDA ........................................................................................................8 Data Cleaning ................................................................................................................ 9 Adding New Variables: Removal of unwanted variables: Univariate Analysis Bi Variate Analysis Multivariate Analysis Missing Value Treatment Outlier Treatment

Model Building & Tuning……………………………………………………………………………………………………….10 Splitting the data into Test and Train Smote Logistic regression VIF VALUES Naïve Bayes K Nearest Neighbor KNN Model Comparisons Bagging Boosting

Model Validation .......................................................................................................... 9 Model Comparison

Conclusion and Recommendation…………………………………………………………………….15 Impact of Risk Modelling………………………………………………………………………………….16

A. Introduction Defining Problem Statement: Banks and financial institutions borrow funds from depositors / lenders/Investors and lend the same to borrowers the difference of cost of funds and lending rate generally denotes the Net income of a Bank. Thus, lending is a critical business activity for a Bank. Before offering loans, the bank does a comprehensive credit risk analysis of its borrowers which is largely based on three parameters. 1. Financial Risk 2. Business Risk 3. Market Risk Analysis of past repayment behaviors of the borrowers with their financial, business and market situations can help Banks/ Financial institutions to develop models which can predict probability of default of the future borrowers of the bank/financial institution. Business Problem Recently Banks are experiencing surge in loan defaults (i.e. customers are not paying instalments on time) in individual loans issued bases credit scores. To improve loan default prediction banks, need to scrutinise other parameter other than credit score to identify potential loan defaults Need for the Project The project focus to build a predictive model for loan default based on the historical data provided of the existing customers of a Bank. Which will help the bank to identify potential defaulters and help devise its lending policies. This model will also help in bank making more accurate lending decisions with low turnaround time for lending decisions. Scope of the Project To build and compare different predictive models on the data provided and recommend the best suited bases model performance parameters

Understanding business/social opportunity Reducing probability of a loan default will help the bank bring down Net credit loss of the bank which is a cost a bank provisions as per regulations. This brings a positive impact on the bottom line of the bank (Net Interest Income & PAT). Thus, providing more security to the depositors and value to its shareholders. Stronger banks with sound financial health are key drivers for the economy and support economic growth of a country.

B. Exploratory Data Analytics (EDA) Understanding how data was collected in terms of time, frequency and methodology: The Data provided is of a US Bank of its customers who have borrowed from the bank from Dec 2011 and Jan 2015 (5 Years). The terms of all the loans has expired and we have clear identification of defaulters in the loan status as Default/Fully Paid. There are 2.26 Lac customers and the observations are divided into

41 variables. Following are the types of Variables present in the data. a) Demographic variables: Which contain information regarding type of customer Like Member Id, Employment years, house ownership etc. b) Loan Variables: Loan Amount Financed, End use, Rate of interest, previous history of repayment before offering last loan, Repayment of the current loan, outstanding balance, last repayment date, loan amount applied by the customer and loan amount funded etc. c) Credit Variables: Data also captures variables which capture outcome of credit due diligence done at the time of loan approval like grade, number of pasts enquires, verification status end use of the loan amount, description etc. Visual inspection of data (rows, columns, descriptive details •

The data has 41veriables with 226786 observations out of the data provided 16 are Discrete variables and 25 are Continuous. Data transformation is required to be done on the categorical variables and are to be converted to binary. There are 9298226 total values out of which 484959 are missing. There are Nil all missing columns.

•

Loan Status is the dependent variable which captures if the borrower has defaulted in loan repayment or has fully repaid the loan amount. As per the data provided nearly 8.4% (19063) of the total population of borrowers have defaulted which is a big number as far as banking norms are concerned. This also indicates that the data is imbalanced as the default population is low as compared to total population.

•

• • • •

Most of the borrowers are individual only 6 customers have joint loan application, almost 60 % of the loans have been taken for debt consolidation or financial restructuring this also corroborates the fact for 8.4% of delinquency. The customers have borrower for either 36 months or 60 months term and 36 month terms are approximately 80% this denotes that the customers have availed mid term loans. Variables term, verification, application type and loan status have only two levels thus can be easily converted to binary terms. There are some date variables like issue date etc which have to be removed as they may impact model performance due to their time series nature. Missing Value is present in only 2 observations.

Univariate Analysis Histograms of both continuous and categorical variables

Due to presence of outliers the variables are skewed except DTI ( Debt To income Ratio) which is bit normally distributed.

Bi Variate Analysis

Boxplot of all continuous Variables in relation with dependent variable (Default)depicts how effectively

the independent variables can differentiate between default (1) & non default(0). Outstanding principal is high in case of defaulters, Because of low creditworthiness defaulters have a higher interest rate as compared to non defaulters. • Annual income is higher in non-defaulters than in defaulters. • Recovery fee is also higher in case of defaulters. Multivariate Analysis • •

• •

Variables loan_amnt" , "funded_amnt" ,funded_amnt_inv" are highly correlated, thus can affect model performance and hence are to be removed from the data. total_pymnt" & "total_pymnt_inv" are highle corelated with each other.

C. Insights From the EDA • •

Main purpose of loan is for debt consolidation followed by credit card. Delinquency is high in high interest rate borrowers suggesting the pricing on the loans were done

• • • •

bases risk-based pricing the same is also corroborated with the fact that higher is the rate with low grade. 8.4% of the borrowers have defaulted thus depicts the data is highly imbalanced, thus we must use smote to balance the data before creating predictive models. Lot of variables are showing high collinearity, means that they are giving same information thus must be removed from the data. Date variables must be removed from the data as they can impact correctness of predictive models. Other character variables like number of years of employment, grade, home ownership which have multiple levels must be converted into factor variables before model creation. Variables which cannot be converted like description and address state have to be removed from the data.

D. Data Cleaning Adding New Variables: Following categorical variables have been converted into binary variables and old variables have to be removed • • •

Loan_status is converted to Default (This is the Dependent Veriable) application_type is coverted to Napplication_type term Is converted to Nterm

• •

purpose is converted to Npurpose verification_status To Nverification_status

Removal of unwanted variables: There are some like next_payment_dt which do not provide any information also there are variables which are providing similar information or are converted to binary are to be removed before we do EDA. Following Variables have been removed from the dataset. • • • • • • • • •

member_id mths_since_last_delinq recoveries collection_recovery_fee Loan_status application_type term purpose verification_status

Missing Value Treatment Months since borrowers last delinquency and revolving utilization rate have missing values • •

Month since borrowers last delinquency accounts is removed from the database and is left from the analysis. Revolving utilization rate missing values would be replaced with mean values.

Outlier Treatment Outstanding Principal, Outstanding Principal Investors, Total Payment, Total Payment Investor, Total Principal received, Total interest received, Total late fee last payment amount are the variables outliers. Quantile method is used to treat outliers in the dataset

E. Model Building & Tuning Variables Which were considered for model building names(bank.num) [1] "member_id" "loan_amnt" [3] "funded_amnt" "funded_amnt_inv" [5] "int_rate" "installment" [7] "annual_inc" "dti" [9] "delinq_2yrs" "inq_last_6mths" [11] "mths_since_last_delinq" "open_acc" [13] "revol_bal" "revol_util" [15] "total_acc" "out_prncp" [17] "out_prncp_inv" "total_pymnt" [19] "total_pymnt_inv" "total_rec_prncp" [21] "total_rec_int" "total_rec_late_fee" [23] "recoveries" "collection_recovery_fee" [25] "last_pymnt_amnt" "Default" [27] "Napplication_type" "Nterm" [29] "Npurpose" Splitting the data into Test and Train Dimensions Train Data: 158750 Observations 29 Variables Dimensions Test Data : 68036 Observations 29 Variables Smote Since the provided data is highly imbalanced, we must synthetically improve the minority class to get better accuracy. Before Smote: Data says 8.4% of the customers have defaulted Smote is applied to train data to and post using smote the minority class in improved to 33.3% All the models are built on both Unbalanced and balanced data to check improvement in model performance. Logistic regression Logistic regression is part of the supervised learning. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. There is significant change in log likelihood from the base model. Also based on the p-value we can reject the null hypothesis. Thus, the model is valid. Call: glm(formula = Default ~ ., family = binomial, data = train1) Deviance Residuals: Min 1Q Median 3Q Max -8.49 0.00 0.00 0.00 8.49

Coefficients: (3 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.890e+15 1.146e+06 -2.521e+09...