Title | Logistic Regression - Lecture notes |
---|---|
Course | Business Analytics |
Institution | The University of Texas at Dallas |
Pages | 6 |
File Size | 273 KB |
File Type | |
Total Downloads | 90 |
Total Views | 140 |
Lecture notes...
Motivation ● Using regression models for classification ○ Use linear regression to predict someone’s income ● However, the dependent variable is sometimes "limited." ○ E.g., voting, morbidity/ mortality ○ Participation data may not continuous or distributed normally ● Logistic regression can be a good choice when the dependent variable is a dummy variable ○ Dummy variable: a 0/1 variable that represents a binary choice. ○ E.g., 0 → did not vote, 1 → did vote ● Linear regression isn’t always good because there is a constraint on the YValue ○ If 0 means no and 1 means yes, what happens when the dot is in the middle?
The Logistic Regression Model
● p: Probability that the event Y occurs, i.e., p(Y = 1) ● If the B is positive, and X increases, then p increases ● If (⍺ + ꞵX) = 0, p = 0.50 ● As ( ⍺ + ꞵX) increases, p → 1 ● As (⍺ p→0 ●
+ ꞵX) decreases,
● Equivalent Expression:
Interpreting Coefficients ● The odds is intuitive ● Since the two equations above are the same, eꞵ is the effect of the independent variable on the odds, or the odd ratio (start from right to left on the equation below)
● If eꞵ (the observation) is given, then it will describe the odds ratio ○ The ratio changes when the X increases by one unit ○ If X increases by one unit, the odds increases by that much
Maximum Likelihood Estimation ● MLE - statistical method for estimating the coefficients of a model ● The Likelihood Function (L) measures the probability of observing the particular set of dependent variable values (p1, p2, …, pn) that occur in the sample
○ The higher the value of L → the higher the probability of observing (p1, p2, …, pn) in the sample
● Unit increase in tenure, household is 1.02 times more likely to stay ● Household with pets are 1.933 times more likely to stay ● Mobile home are 4.75 times more likely to evacuate ● Unit increase in education, 1.0514 times more likely to evacuate
Evaluating Model Performance ● Commonly used statistics to compare alternative models (or to evaluate the performance of a single model): ● Compare the model that includes ( ⍺) versus the model that includes (⍺, ꞵ)
● Model Chi-Square (ꞵ2) ○ The model log-likelihood ratio (LR) statistics:
○ The LR statistic is distributed chi-square with i degrees of freedom ■ i is the number of independent variables ○ Use the “Model Chi-Square” statistic to determine if the model (⍺, ꞵ) is statistically significant ○ If p-value is small, then the two models are different and we reject the null hypothesis ■ If estimated p >= 0.5 then event is expected to occur, and ■ If estimated p < 0.5 then event is not expected to occur
The LR Test ● For your variables, find the highest p-value, and remove it ● Perform calculation again, and remove the highest p-value, etc. etc. ● If ꞵ2 value is < critical value, then the coefficients are not statistically significant ● Full model not an improvement over the partial model ● If ꞵ2 value is > critical value → statistically different
Structural Break ● Could have structural breaks in the data
● Pooling data imposes the restriction that an independent variable has the same effect on the dependent variable for different groups of data ○ May not be true ● Can conduct a likelihood ratio test: ○ LR[i+1] = -2LL(pooled model) ○ [-2LL(sample 1) + -2LL(sample 2)] ● where samples 1 and 2 are pooled, and i is the number of independent variables
Constructing the LR Test - Example
● P-value is increasing to the left, and the values (we need 0.43) is decreasing going to the left, so we know that we will not reject the null hypothesis because p will be greater than chi-square ○ We use third row because that’s where the 11.345 is located...