Lecture 7 + 8 notes PDF

Title	Lecture 7 + 8 notes
Author	John Doe
Course	Introductory to Statistics
Institution	The University of British Columbia
Pages	9
File Size	336.8 KB
File Type	PDF
Total Downloads	86
Total Views	271

Preview

CLICK TO PREVIEW PDF

Summary

COMMERCE 291 – Lecture Notes 2021 – © Jonathan BerkowitzNot to be copied, used, or revised without explicit written permission from the copyright owner.Summary of Lectures 7 and 8More on Linear RegressionWe introduced Linear Regression in the previous lecture notes. Here’s a quick review.Linear Regr...

Description

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

COMMERCE 291 – Lecture Notes 2021 – © Jonathan Berkowitz Not to be copied, used, or revised without explicit written permission from the copyright owner.

Summary of Lectures 7 and 8

More on Linear Regression We introduced Linear Regression in the previous lecture notes. Here’s a quick review. Linear Regression aims to answer three questions. • How do we find the “best-fitting” straight line through the scatterplot? • Can we summarize the dependence of Y on X with a simple straight line equation? • Does the explanatory variable (X) help us explain the outcome variable (Y)? The straight-line equation is called the least squares regression equation (or line): 𝑦 = 𝑏0 + 𝑏1 𝑥 𝑆

where: 𝑏1 = 𝑟 𝑆𝑦 𝑥

and

𝑏0 = 𝑦 − 𝑏1 𝑥

The term “least-squares” comes from minimizing (i.e., least) the sum of squared vertical distances from the data points to the regression line. That is the criterion for “best-fitting.” Other criteria are theoretically possible, but they don’t address the questions we are trying to answer with regression. The vertical distances are called residuals. They are defined as the difference between the observed y – predicted y. In notation, we write 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖 . The meanings of b0 and b1 should be familiar from elementary math. b1 = change in Y for a unit change in X b0 = value of Y when X is 0 To use for prediction, substitute a value of X into the equation and compute Y.

Interpretation of Regression Now let’s explore the meaning and interpretation of regression. We can think of the main idea of regression in two simple ways. 1. A point that is 1 standard deviation above the mean in the X-variable is, on average, r standard deviations above the mean in the Y-variable. 2. For each value of X, the regression line goes through the average value of Y.

1

Illustration of Regression 8 7 6 5 Y

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

4 3 2 1 0 0

1

2

3

4

5

6

7

8

X

At first glance, the position of the regression line is counterintuitive. The regression line has a much lower slope that the “line of symmetry” which corresponds to the principal axis of the ellipse describing the scatterplot. That is because the roles of X and Y are NOT interchangeable; it matters which is Y—we are minimizing distances from the regression line in the Y direction. By the way, the line of symmetry is what you would get if you minimized the perpendicular distances from the data points to the line! The concepts and calculations in regression come from Sir Francis Galton and his disciple Karl Pearson. Some of their pioneering studies were about quantifying hereditary influences and resemblances between family members. One study examined the heights of 1078 fathers and their sons at maturity. (Source: Freedman, Pisani, Purves) A scatterplot (see next page) of the 1078 pairs of values shows a positive linear relationship—taller fathers tend to have taller sons, and shorter fathers tend to have shorter sons. The summary statistics (in Imperial units, since that is what Galton and Pearson used) for the data set are as follows: Mean height of fathers (X) ≈ 68 inches; SD ≈ 2.7 inches Mean height of sons (Y) ≈ 69 inches; SD ≈ 2.7 inches; r ≈ 0.5 Using the least squares estimate we compute: 2.7 ) 2.7

𝑏1 = 0.5 (

= 0.5

and

𝑏0 = 69 − (0.5)68 = 35.

Thus, the regression equation is: Son’s height = 35 + 0.5 x Father’s height

2

Heights of Fathers and Sons 80 78 76 74

Son's Height (inches)

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

72 70 68 66 64 62 60 58 58

60

62

64

66

68

70

72

74

76

78

Father's Height (inches)

3

This is a good time to discuss the regression effect. Consider a father who is 72 inches tall. A naïve prediction would be that, since the mean for all sons is one inch greater than the mean for all fathers, then a 72-inch father could be expected to have a 73-inch son! But that would mean that each generation is one inch taller than the previous! This would only be true if the correlation were perfect. But obviously, there are other influences, such as the mother’s height! Instead, the regression equation accounts for the weak correlation between fathers’ and sons’ heights. Using the regression equation son’s height would be predicted to be: 35 + 0.5x72 = 71 inches. So taller fathers do have taller sons, on average, but not necessarily as far above average. This is the regression effect. It can be stated as follows: “In virtually all test-retest situations, the bottom group on the first test will, on average, show some improvement on the second test—and the top group will, on average, fall back. The regression fallacy consists of thinking that the regression effect must be due to something important, not just the spread around the line.” (Source: Freedman, Pisani and Purves) There are countless examples and illustrations of the regression fallacy. And the error shows no signs of abating. Even experts who should know better, fall for the fallacy. Examples of the regression fallacy: • The “sophomore jinx” in professional sports • Movie sequels • Returns in mutual fund investing • Reward and punishment in child-rearing • Red-light cameras at intersections • The cover of Sports Illustrated jinx Galton called the regression effect “regression to mediocrity.” Today we often refer to it as “regression to the mean.”

4

How good is the fit? Answer #1: Compute a one-number summary statistic. Square the correlation coefficient and you get a quantity denoted by r2, and called, not surprisingly, r-squared or r-square. It can be interpreted as the fraction of variation in the y-values that is explained by the least squares regression of Y on X. Since r is a number between -1 and 1, r2 is a number between 0 and 1 and can be expressed either as a proportion or a percentage.

Answer #2: Examine the residuals graphically. Compute all the residuals, 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖 , and plot them on the vertical axis versus X on the horizontal axis. The plot should show a horizontal band around 0 and should be symmetric around 0 if the regression provides an adequate fit. Other patterns indicate potential problems. If the data points are more tightly clustered on one side of the horizontal axis and more spread out on the other side, that suggests skewness in the Y-variable. If the residual point shows curvature, then the linear trend is not a sufficient description of the pattern in the scatterplot. If the residual plot shows a funnel shape, then the accuracy of the predictions depends on the value of X. If the residual plot still shows a linear trend, then you computed the wrong line! Skill and experience are needed to that you don’t overreact to patterns. Residual plots are also useful for identifying outlier and influential observations: Outlier: a data point that lies outside the overall pattern of the other data points. Points that are outliers in the y-direction have large residuals, but outliers in the x-direction may not have large residuals. Pay close attention to the y-direction outliers! Influential observation: one which, if removed, would markedly change the position of the regression line.

5

Measuring the spread around the regression line Recall that the standard deviation of a set of data values is a measure of the typical distance from any observation to the mean of the data values. Now we need a new standard deviation in this context that is a measure of the typical distance from any y-value to the regression line. That is, we need the standard deviation of the residuals. Since the mean of the residuals is 0, this new standard deviation is easily expressed as:

𝑠𝑒 = √

∑ 𝑒𝑖2 𝑛−2

Note that the denominator is n-2 rather than n-1 as it was previously because two preliminary calculations, namely of b1 and b0 need to be made before the residuals can be computed. Standard deviations are crucial in statistics. So, a good place to end is by restating what this standard deviation means. 𝒔𝒆 = SD of the residuals = typical distance from a data point to the regression line. We will revisit regression later in the course, as part of inferential statistics and statistical modelling. This brings us to the end of Descriptive Statistics.

6

Chapter 5. Randomness and Probability We are ready to move on from our work on Descriptive Statistics, to Inferential Statistics. But before we get there, we will address some of the foundational concepts: randomness, probability, random variables, probability distributions. If these terms sound familiar, that’s good, because they were part of the prerequisite course leading to our course. The concepts were used previously with a different goal in mind. We will revisit them in the context of their use and importance in Statistics. Random and Randomness Let’s start with random and randomness. How do we think about and talk about these concepts? “That’s so random!” Randomness is difficult for humans because humans are a pattern-seeking species. We aren’t good with randomness, and even try to make patterns out of random things. A random phenomenon is something that produces random outcomes. Individual outcomes are uncertain but there is a regular distribution of outcomes in a large number of repetitions. For example, we can’t predict the result of any particular coin toss, but we know that the proportion of heads gets closer and closer to 50% as the number of tosses increases. Probability There are multiple definitions and conceptions of this term. We have lots of synonyms, such as likelihood, chance, and odds. One definition of probability is the proportion of times an outcome would occur in a very large number of repetitions. The probability of a fair coin coming up heads is 0.5 because the percentage of heads approaches 50% as the number of tosses increases. This is known as empirical probability. Think of it as long-run relative frequency. Another example is the probability of a defective product in assembly line production. There are other definitions of probability. Model-based probability is theoretical. A coin toss has two possible outcomes, and they are equally likely, hence the chance of one of the two outcomes must be 0.5. This meaning of probability requires one to list all possible outcomes and assign likelihoods to each outcome. That can be done in well-defined situations like card games; in poker, what is the probability of royal flush?

7

Personal probability is subjective. It treats probability as “degree of belief.” There is no model, and no repetitions that lead to long-run frequency. It is an individual assessment from experience or intuition or prior knowledge. For example, what is the probability that a stock price will surpass one hundred dollars in the next year? We will base our work on the first one—long-run relative frequency. But no matter which definition we use the interpretations are largely the same.

Here are the key terms, properties, and principles. Outcomes and Events Each attempt or trial or realization of a random phenomenon leads to an outcome. Combinations of outcomes are called events. For example, with a deck of cards: • Drawing an ace is an outcome • Drawing three aces in a row is an event Law of Large Numbers We said earlier that in a random phenomenon, we can’t know what the next outcome will be, but we can try to understand characteristics of the long-run behaviour. There is stability in long-term results. Sample Space The sample space is the complete list of all possible outcomes. • We don’t need to write down this list…! • But we do need to understand it, including what is in it and what is not in it. Probability As mentioned earlier, we will define the probability of an event as its long-run relative frequency. Independence The outcome of one event or trial doesn’t influence or change the outcome of another. Disjoint or Mutually Exclusive events Two events that can’t both happen at the same time. Classic Blunder: Law of Averages: ==> There is no such thing! • I’ve lost quite a few poker hands in a row, so it is my turn to start winning! WRONG! • The coin has landed on “tails” ten times, so the next one is more likely to be “heads”. WRONG!

8

You studied probability in COMM 290; we will not cover it here, except to remind you of some of the main rules of probability. Six Main Rules of Probability (from COMM 290) Note: For clarity I will abbreviate Probability as “Pr”, rather than just “P”. That’s because the single letter, p, appears in many different contexts in Statistics. Probability Rule #1 Probabilities must be between zero and one. For an event A, 0 ≤ Pr(A) ≤ 1. Probability Rule #2 The probability of the set of all possible outcomes—also called the sample space, S— must be 1: Pr(S) = 1 Probability Rule #3 Either an event A will happen, or it won’t happen (which is referred to as the complement of A, AC): Pr(A) = 1 – Pr(AC) Probability Rule #4 Multiplication Rule: If A and B are independent: Pr(A and B) = Pr(A) x Pr(B) Probability Rule #5 Addition Rule: If A and B are disjoint (mutually exclusive): Pr(A or B) = Pr(A) + Pr(B) Probability Rule #6 General Addition Rule: Pr(A or B) = Pr(A) + Pr(B) – Pr(A and B)

READ CHAPTER 5 FOR MORE REVIEW OF PROBABILITY, INCLUDING A SECTION CALLED FUN WITH PROBABILITY FOR SOME CLASSIC EXAMPLES. Closing Observation: Probability questions can be deceptive. There are perhaps more examples of counterintuition in probability than in any other field of study. So, be careful!

*** END OF LECTURE 8 ***

9...