Title | Chapter 5 Intro to linear regression |
---|---|
Course | Introduction to Statistics and Data Analysis |
Institution | University of Michigan |
Pages | 4 |
File Size | 278.6 KB |
File Type | |
Total Downloads | 106 |
Total Views | 151 |
SUMMARIZED NOTES ON CHAPTER 5...
Chapter 5: Intro to linear regression Linear Regression: ● Powerful statistical technique used to predict or evaluate whether there is a relationship between two numerical variables. ● One variable is explanatory and the other response variable. (x,y) ● Explanatory Variable: used to predict the response variable. Denoted by X and specific values are denoted by x. ○ Explains changes we see in the response variable ● Response variable: outcome -- responds to the explanatory variable. Denoted by Y and specific values are denoted by y. ● Remember: Correlation does not mean causation! Relationships between Numerical Variables: Scatterplots ● We use a scatter plot to display association between two numerical variables ○ Provides case by case view of data ○ Each point represents a single case measured on the two variables When describing Associations --- DUFS: 1. Direction: Positive or negative 2. Unusual Features: outliers, influential points, leverage 3. Form: Linear or not 4. Strength: weak, moderate, strong (no correlation = no linear relationship)
● trong correlation = linear relationship, examine the scatterplot to see if the graph follows a straight-line or not. Modeling a linear relationship with a straight line ● Regression line: a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x ● Linear regression model:
µ𝑦 = β0 + β1𝑥 ● µ𝑦 = 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑜𝑓 𝑌 𝑤ℎ𝑒𝑛 𝑋 = 𝑥 ● β = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑑𝑒𝑛𝑜𝑡𝑖𝑛𝑔 𝑦 − 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡; 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑌 𝑤ℎ𝑒𝑛 𝑥 = 0 0
● β1 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑑𝑒𝑛𝑜𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒; 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑌 𝑝𝑒𝑟 𝑢𝑛𝑖𝑡 𝑐ℎ𝑎𝑛𝑔 ● Sample data estimate:
𝑦 = 𝑏𝑜 + 𝑏1 𝑥 ● Residuals: Difference between actual value of y and the value predicted by the regression line.
𝑒𝑖 = 𝑦𝑖 − 𝑦 ○ Each observation has residuals ○ Observations above have +residual and below have -residuals ○ We want residuals to be as small as possible ○ Notation: sample of n observations are denoted by (𝑥1, 𝑦1), ( 𝑥2, 𝑦3).... ( 𝑥𝑛, 𝑦𝑛), if we refer to a point we denote it as (𝑥 , 𝑦 ) 𝑖
○ The average of the residuals
Σ𝑒𝑖 𝑛
𝑖
is zero. 2
○ Line fits best when the sum of squared residuals is minimized
Σ𝑒𝑖 𝑛
Least-squares regression Line: 1. An estimate of slope of the line is 𝑏
1
= 𝑟*
𝑠𝑦 𝑠𝑥
2. The line must pass through (𝑥 , 𝑦), thus y-intercept should be 𝑏 = 𝑦 − 𝑏 𝑥 0
1
3. The equation then is :
𝑦 = 𝑏𝑜 + 𝑏1 𝑥 R calculated lsrl
Extrapolation & 𝑅
2
● Extrapolation: use or reg. Line for prediction FAR outside the interval of x-values used to obtain the line. Predictions often NOT accurate. Don’t Extrapolate 2
● 𝑅 - R-squared: used to explain the strength of a linear fit. ○ Describes the amount of variation in the response variable that is explained by lsrl. ( least line of regression- something like that 2
○ Closer 𝑅 is to 100% the better the fit 2
2
2
○ 𝑅 =
𝑆𝑌 − 𝑆𝑅𝐸𝑆𝐼𝐷 2
𝑆𝑌
Types of Outliers in LinReg. ● High Leverage: Points that fall horizontally far from the lsrl -- in x direction. Can influence the slope of the lsrl. ● Influential point: If one of these high-leverage points appear to actually invoke its influence on the slope of the line. ○ To check if a point is influential make the lsrl line with or without the point and compare the slope and y-intercept.
● Outliers that fall in the pattern make the lsrl stronger ● Outliers should not be removed without a good reason, and should be clearly communicated why or not to remove....