Chapter 5 Intro to linear regression PDF

Title	Chapter 5 Intro to linear regression
Course	Introduction to Statistics and Data Analysis
Institution	University of Michigan
Pages	4
File Size	278.6 KB
File Type	PDF
Total Downloads	106
Total Views	151

Preview

CLICK TO PREVIEW PDF

Summary

SUMMARIZED NOTES ON CHAPTER 5...

Description

Chapter 5: Intro to linear regression Linear Regression: ● Powerful statistical technique used to predict or evaluate whether there is a relationship between two numerical variables. ● One variable is explanatory and the other response variable. (x,y) ● Explanatory Variable: used to predict the response variable. Denoted by X and specific values are denoted by x. ○ Explains changes we see in the response variable ● Response variable: outcome -- responds to the explanatory variable. Denoted by Y and specific values are denoted by y. ● Remember: Correlation does not mean causation! Relationships between Numerical Variables: Scatterplots ● We use a scatter plot to display association between two numerical variables ○ Provides case by case view of data ○ Each point represents a single case measured on the two variables When describing Associations --- DUFS: 1. Direction: Positive or negative 2. Unusual Features: outliers, influential points, leverage 3. Form: Linear or not 4. Strength: weak, moderate, strong (no correlation = no linear relationship)

● trong correlation = linear relationship, examine the scatterplot to see if the graph follows a straight-line or not. Modeling a linear relationship with a straight line ● Regression line: a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x ● Linear regression model:

µ𝑦 = β0 + β1𝑥 ● µ𝑦 = 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑜𝑓 𝑌 𝑤ℎ𝑒𝑛 𝑋 = 𝑥 ● β = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑑𝑒𝑛𝑜𝑡𝑖𝑛𝑔 𝑦 − 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡; 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑌 𝑤ℎ𝑒𝑛 𝑥 = 0 0

● β1 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑑𝑒𝑛𝑜𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒; 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑌 𝑝𝑒𝑟 𝑢𝑛𝑖𝑡 𝑐ℎ𝑎𝑛𝑔 ● Sample data estimate:

𝑦 = 𝑏𝑜 + 𝑏1 𝑥 ● Residuals: Difference between actual value of y and the value predicted by the regression line.

𝑒𝑖 = 𝑦𝑖 − 𝑦 ○ Each observation has residuals ○ Observations above have +residual and below have -residuals ○ We want residuals to be as small as possible ○ Notation: sample of n observations are denoted by (𝑥1, 𝑦1), ( 𝑥2, 𝑦3).... ( 𝑥𝑛, 𝑦𝑛), if we refer to a point we denote it as (𝑥 , 𝑦 ) 𝑖

○ The average of the residuals

Σ𝑒𝑖 𝑛

𝑖

is zero. 2

○ Line fits best when the sum of squared residuals is minimized

Σ𝑒𝑖 𝑛

Least-squares regression Line: 1. An estimate of slope of the line is 𝑏

1

= 𝑟*

𝑠𝑦 𝑠𝑥

2. The line must pass through (𝑥 , 𝑦), thus y-intercept should be 𝑏 = 𝑦 − 𝑏 𝑥 0

1

3. The equation then is :

𝑦 = 𝑏𝑜 + 𝑏1 𝑥 R calculated lsrl

Extrapolation & 𝑅

2

● Extrapolation: use or reg. Line for prediction FAR outside the interval of x-values used to obtain the line. Predictions often NOT accurate. Don’t Extrapolate 2

● 𝑅 - R-squared: used to explain the strength of a linear fit. ○ Describes the amount of variation in the response variable that is explained by lsrl. ( least line of regression- something like that 2

○ Closer 𝑅 is to 100% the better the fit 2

2

2

○ 𝑅 =

𝑆𝑌 − 𝑆𝑅𝐸𝑆𝐼𝐷 2

𝑆𝑌

Types of Outliers in LinReg. ● High Leverage: Points that fall horizontally far from the lsrl -- in x direction. Can influence the slope of the lsrl. ● Influential point: If one of these high-leverage points appear to actually invoke its influence on the slope of the line. ○ To check if a point is influential make the lsrl line with or without the point and compare the slope and y-intercept.

● Outliers that fall in the pattern make the lsrl stronger ● Outliers should not be removed without a good reason, and should be clearly communicated why or not to remove....