Week 7 - Predictive Analytics (Logistic Regression) PDF

Title	Week 7 - Predictive Analytics (Logistic Regression)
Author	Giulia Leone
Course	Analytical Methods for Marketing
Institution	City University London
Pages	11
File Size	386.4 KB
File Type	PDF
Total Downloads	66
Total Views	142

Preview

CLICK TO PREVIEW PDF

Summary

Predictive Analytics (Logistic Regression)...

Description

Marketing Analytics Week 7 - Predictive Analytics (Logistic Regression)

LECTURE 7 – PREDICTIVE ANALYTICS (LOGISTIC REGRESSION)

PART I - INTRODUCTION TO PREDICTIVE ANALYTICS & RECAP LINEAR REGRESSION What is predictive analytics? Predictive analytics - Predictive analytics constitutes one of the four key types of marketing analytics. They determine likely outcomes given certain inputs. As such, predictive analytics looks at what could happen in the future by forecasting variables of interest. Thus, the general aim of predictive analytics is to build an analytical model that is able to predict a specific target outcome of interest.

Types of predictive models In general, two types of predictive models can be differentiated: 1. ESTIMATION MODELS Estimation models approximate outcomes using relationships and equations. They are used to predict the value of a variable. Linear regression represents a baseline estimation modelling technique to model a continuous target variable of interest; ie a target variable that can take an infinite number of values between any two ranges of values. 2. CLASSIFICATION MODELS Classification models identify patterns to explain how variables contribute to specific outcomes. They are often used to predict categorical or binary target variables of interest, ie a target variable that can only take two or a finite number of values. Logistic regression represents one of the foundational classification modelling techniques that is used to predict the probabilities of a certain target variable occurring.

The basic idea of linear regression What is a linear regression? Linear regression - A regression is a mathematical model that creates an arithmetic equation to explain the relationship between variables. A linear regression model analyses the linear relationship between one outcome variable, Y, and one or more predictor variables, X1, X2, … Xn. -

Y = Dependent variable (sometimes also referred to as outcome or target variable). X = Independent variables (sometimes also referred to as predictor variable).

Both the dependent variable and the independent variables are measured on interval or ratio scales (except in the special case of using dummy variables as independent variables).

Types of linear regressions Depending on the number of independent variables used in the regression model to predict a dependent variable, two major regression types can be differentiated:

Marketing Analytics Week 7 - Predictive Analytics (Logistic Regression) 1. Bivariate (simple) linear regression: One X Bivariate regression analysis is a statistical technique that analyses the linear relationship between one dependent variable(Y) and one independent variable (X). 2. Multiple regression: More than one X Multiple regression analysis is a statistical technique that analyses the linear relationship between one dependent variable (Y) and multiple independent variables (X1,X2,…Xn).

The applications of linear regression Linear regression helps decision makers to: 

understand the relationship between a set of variables via statistical analysis using historical data



predict the value of an outcome variable ( Y) based on a set of impacting factors ( X1,X2,…Xn).

Linear regression in marketing In marketing, Y (the dependent variable) is usually related to market performance (eg, satisfaction, sales), and X1, X2, … Xn (the independent variables) are often related to actionable variables that marketers can influence (eg, 4Ps) or consumer characteristics. Potential marketing applications of linear regression models include: 

estimating effects of marketing mix variables on sales/market shares



quantifying the relationship between demographic or psychographic variables and attitude or loyalty towards a product or service



determining variables that predict the sales of a product or service



understanding the impact of different marketing mix elements on overall customer satisfaction.

Fundamental goal and basic function The fundamental goal of a linear regression analysis is to fit a straight line through the points on a chart between the dependent and the independent variables. Once the equation for the straight line is known, we may estimate the value of the dependent variable for any valid value of the independent variables.

Marketing Analytics Week 7 - Predictive Analytics (Logistic Regression) Formula The general formula for a straight line in a simple bivariate regression is: Y=a+b∗X+ei

Where: 

Y= the dependent variable



a = the intercept (point where the straight line intersects the Y-axis when X=0)



b = the slope (represents the regression coefficient and refers to the change in Y for every 1 unit change in X)



X = the independent variable used to predict Y



ei = the error for the prediction.

Note that in multiple regression analysis, multiple independent variables are entered into the regression equation, and for each variable a separate regression coefficient is calculated that describes its relationship with the dependent variable. The relationship between each independent variable and the dependent measure is still linear.

Multiple linear regression in SPSS In the following lessons, we are going to recap how to conduct a multiple linear regression in SPSS and how to interpret SPSS outputs using the following case of online banking customers. The case Imagine you are working as an analyst for a bank and there is substantial disagreement in the senior management group of the bank on whether: -

Option 1- The bank should start charging fees for the use of the online banking channel. Option 2 - The bank should begin offering customer incentives such as rebates and lower service charges to encourage greater use of the online banking channel.

The debate really hinges on whether online customers are indeed more profitable customers. Particularly, the senior management group wants to know how a customer’s profit relates to the channel used (offline vs online) and his/her demographic characteristics (eg, age, tenure). Assume that IT provided you with a data sample of 30,000 customers, including age, tenure, income, customer profitability and whether a customer has opened an online account or not.

Conceptual model To start with it is helpful to first draw a conceptual model and to determine the dependent and independent variables. What would this model and the regression function look like? What would be the dependent and independent variables based on the information given in the introduction of the case?

Marketing Analytics Week 7 - Predictive Analytics (Logistic Regression) And what type of linear regression would we need to conduct? In this case we are interested in investigating how channel usage (offline vs online) and other demographic characteristics are able to predict customer profitability. Accordingly: 

Y: the dependent variable Y (the variable we want to predict) is customer profitability



X: the independent variables (the variables we are going to use to predict customer profitability) are: 1. 2. 3. 4. 5.

Online Account (X1) Income Bracket (X2) District (X3) Age Group (X4) Relationship Length (X5)

Graphically, our conceptual model would like this:

As we are dealing with more than one independent variable, we would therefore need to conduct a multiple regression analysis.

Final regression results The final regression equation is: Predicted Customer Profitability = -91.732 + 18.240*Online Account + 18.279*Age Group + 17.846*Income Bracket + 4.029*Relationship Length Hence, all else being equal: 

when a customer uses the online (as compared to offline) channel, customer profitability is predicted to increase by £18.24



when a customer’s age increases by one unit, customer profitability will increase by £18.28



when a customer’s income increases by one unit, customer profitability will increase by £17.84



when the relationship length increases by one unit, customer profitability will increase by £4.03.

Please note that different scales have been used to measure the independent variables. Hence, one unit denotes different measurement units for each variable. For example, the variable ‘relationship length‘ has been measured in years. So an increase by one unit refers to an increase in the duration of the relationship

Marketing Analytics Week 7 - Predictive Analytics (Logistic Regression) with the bank by one year. In contrast, the variable ‘customer age‘ has been measured in age brackets. Hence, moving up by one unit indicates moving up by one age group (eg, from the age group two, ‘15-24 years’ to the age group three, ‘25-34 years’). To avoid this problem, researchers can calculate the standardised regression coefficient. It is called a beta coefficient. The regression equation can further be used to derive specific managerial insights.

Using the regression equation of the preferred model above, please answer the following questions: 1. What is the predicted profitability of an online customer who is in age group three and income bracket three and is already a customer of the bank for 10 years? 2. Based on the above model, how much profit difference can the bank earn from an online customer as compared to an offline customer (assuming everything else is the same)? 3. Substitution effect: To achieve the same amount of profit that is derived from an online customer in the age group three and income bracket three, how many years should the bank try to extend their service with an offline customer in the same age & income group?

1. Predicted Customer Profitability = -91.732 + 18.240*1 + 18.279*3 + 17.846*3 + 4.029*10 = £75.17 2. £18.24 3. Online customer: Profitability = -91.732 + 18.240*1 + 18.279*3 + 17.846*3 + 4.029*Relationship Length

Offline customer: Profitability = -91.732 + 0 + 18.279*3 + 17.846*3 + 4.029*(Relationship Length + ?) = 18.240/4.029 = 4.5 years

Summary of implications 

Online customers are more profitable than offline customers, given the same age, income, and relationship length.



The place where customers live has no effect on profitability.



Increasing age and tenure can increase profit among online and offline customers.



The bank should begin offering customer incentives such as rebates and lower service charges to encourage greater use of the online channel.

PART II - INTRODUCTION TO LOGISTIC REGRESSION

Logistic regression: the basic idea Many marketing problems and decisions deal with understanding the probability associated with certain events or behaviours. Often these events or behaviours are dichotomous or binary (0 or 1) in nature.

Marketing Analytics Week 7 - Predictive Analytics (Logistic Regression) For example, marketing managers might be interested in: 

predicting whether a customer will make a purchase or not



predicting whether a customer will use a product or not



predicting whether an individual will respond to a promotion or not



predicting whether a customer will ‘churn’ or not



predicting whether an individual will make a certain click on a web page or not



predicting whether a customer will get a loan or not.

In each case, all customers fall into one of two groups (Yes vs No) and the goal is to predict the probability of each customer’s group membership based on a set of independent variables. Typically such dependent variables can be predicted using logistic regression.

Dependent and independent variables Logistic regression represents a classification modelling technique that predicts the probability of a certain class existing or an event occurring depending on one or several independent variables.

In logistic regression: like in linear regression, independent variables can either be: o

continuous (eg, age, income, or sales units) or

o

categorical (eg, gender, religion, or region).

However, unlike linear regression, the dependent variable is either: o

categorical: binary or dichotomous (eg, 1 or 0) or

o

dummy coded.

The two major types of logistic regression Essentially, logistic regression represents a specialised form of a regression analysis that is specifically geared towards predicting a categorical (or dummy coded) dependent variable rather than a continuous metric one. Depending on the number of possible outcome categories that a dependent variable can have, two major types of logistic regression can be differentiated: 

Binary logistic regression: The dependent variable has only two outcome categories (eg, Yes vs No).



Multinomial logistic regression: The dependent variable has three or more possible outcome categories (eg, disease A vs disease B vs disease C).

In the following lessons, we are going to focus on how to conduct a binary logistic regression in SPSS and how to make sense of corresponding regression results.

Marketing Analytics Week 7 - Predictive Analytics (Logistic Regression)

Characteristic of the logistic curve In contrast to linear regression, the logistic curve is not a straight line, but a common ‘S-shaped’ curve.

Recall that the dependent variable always represents probabilities. Its value range is consequently restricted from 0 to 1 as the probability of an event never falls outside these limits. The turning point of the logistic curve is always at a probability level of 50 per cent (ie, p(x = 0) =0.5) and there is a symmetric distribution of values left and right of the turning point. In logistic regression, regression coefficients are estimated using the maximum likelihood estimation technique. This technique maximises a likelihood function that indicates how likely it is that the value of the dependent variable (1 or 0) can be predicted by the independent variable(s). It does so, by computing the highest possible values for y=1 and lowest possible values for y=0.

Sample size and function Sample size As a rule of thumb, each category of the dependent variable should contain at least 10 (better 25) observations per estimated parameter to allow meaningful interpretation. Assuming normal distribution, recommendations of the minimum overall sample size range between 50 and 400 observations depending on the number of independent variables used in the regression model.

Logistic regression function Recall that the basic idea of logistic regression is to model the probability of an event occurring. But instead of modelling such a probability (p) directly, the logistic regression equation can be reformulated as a socalled log odds (logit) model. That is: Log(p/(1−p))=a+b1∗X1+b2∗X2+...+bn∗Xn Please note the following: 

p/(1−p) is called the ‘odds’.



logp/(1−p) is termed the ‘logit value’ which is calculated by taking the logarithm of the odds. Accordingly, odds less than 1.0 will have a negative logit value and odds ratios greater than 1.0 will have positive logit values.



The right-hand side of the equation looks identical to that of a linear regression where a refers to the intercept, b to the regression coefficient, and X is the independent variable used to predict the dependent variable.

Marketing Analytics Week 7 - Predictive Analytics (Logistic Regression) 

In SPSS (and most other software programmes), the input for running a logistic regression is still the original binary variable, but the programme is estimating the above model as the underlying model.

Mathematically, this log transformation solves the problem of restricted range: p can only range from 0 to 1, but log(p/1−p) can be any positive or negative value (see table below). As a result, the log transformation allows us to model in a way similar to linear regression.

Understanding ‘odds’ As you take out from the first column of the table above, probabilities always range between 0 and 1. To get a better understanding of the meaning of ‘odds’, let’s use some examples. For example: 

Let’s say that the probability that an event will occur (eg, that a customer will make a booking) is 70 per cent, thus p = 0.7 (see first column of the table).



Then, the probability that the event will not occur (ie, that the customer will not make a booking) is 30 per cent (1.0 - 0.7 = 0.3) (see second column of the table).



This probability indicates that the odds that the event will occur are 2.333 (0.7 ÷ 0.3 = 2.333). Stated differently, it is 2.3 times more likely that the customer will make a booking as it is that he or she will make no booking (see third column of the table).



The logit value is thus 0.847 (calculated by taking the logarithm of 2.333) (see fourth column of the table).

What is the meaning of an odd value of 1 at a probability level of 0.5? As you can take out from the table above, a probability of 0.5 results in odds of 1.0 (0.5 ÷ 0.5 = 1). This means that both outcomes (ie that the customer will make or will not make a booking have an equal chance

Marketing Analytics Week 7 - Predictive Analytics (Logistic Regression) of occurring). Accordingly, odds with values less than 1.0 reflect probabilities less than 0.5 and odds greater than 1.0 refer to a probability greater than 0.50.

Interpretation of regression coefficients Recall that the logistic curve is not linear. Increasing an independent variable by one unit will, therefore, have a different impact on the probability that a specific event will occur (dependent variable) depending on the particular section of the logistic curve. This has a considerable impact on the interpretation of the regression coefficients. As a result, interpretation is only possible based on the positive or negative sign of the regression coefficients. For example, if the regression coefficient is positive, it should be interpreted as follows: An increase in the independent variable by one unit will lead to an increased probability that the event (dependent variable) will be occurring (y=1).

Interpreting results in SPSS Case processing summary This table tells you about the cases that were included and excluded from the analysis. In our case, all 303 respondents (100 per cent) were included in the analysis.

Dependent variable encoding This table indicates the original coding of the dependent variable and the coding that was used during analysis. In our case both codings are the same.

Block 0: Beginning block The following three output tables indicate the so-called ‘null or empty model’. It represents the starting model of the logistic regression model which does not contain any variables in the model and only includes a constant. It is mainly used to calculate the Pseudo−R2 (see below) and is hence not very informative and can be ignored.

Block 1: Method ‘Enter’: This output section is much more informative. It contains several tables. Of particular importance are the following table: Model summary: This model indicates how good the overall logistic regression model is. Several statistics of interest are reported: -2 Log likelihood: This statistic is not very informative by itself but can be used to compare different models. As a rule of ...