Lecture 9 + 10 notes PDF

Title	Lecture 9 + 10 notes
Author	John Doe
Course	Introductory to Statistics
Institution	The University of British Columbia
Pages	13
File Size	443.3 KB
File Type	PDF
Total Downloads	17
Total Views	960

Preview

CLICK TO PREVIEW PDF

Summary

Description

COMMERCE 291 – Lecture Notes 2021 – © Jonathan Berkowitz Not to be copied, used, or revised without explicit written permission from the copyright owner.

Summary of Lectures 9 and 10

Chapter 6. Random Variables and Probability Distributions Previously, we addressed two key terms and concepts: Randomness and Probability. Let’s expand on those concepts, with two more key terms and concepts. Both should be familiar to you; they were part of the previous course on Quantitative Decision-Making. Although they seem to be purely theoretical concepts, together they are the cornerstone of applied statistical Inference, and, in fact, everything we do in Statistics. The terms are random variable and probability distribution. A random variable (abbreviated as r.v.) is a variable whose value is a numerical outcome of a random phenomenon. More completely, it is the set of possible outcomes and the probabilities associated with them. When the outcomes and their corresponding probabilities are put together, we have a probability distribution. Random variables and probability distributions can be discrete or continuous. A discrete r.v. is just like categorical data. It takes only discrete or integer values. A continuous r.v. is just like quantitative data. It takes continuous values, such as any value in the interval 0 to 1, or any positive (including fractional) value. The mean of a random variable, also called the expected value, is the long-run average outcome. It is denoted by the Greek letter  (which is the equivalent of the English letter "m" for "mean.") For a r.v. X, we write E(X) = μ (pronounced “mu”) Compare that with 𝑥 , which is the mean of the actual data values. The variance of a random variable is the long-run variance of the outcome and is denoted by 2, using the Greek letter  (which is the equivalent of the English letter “s” for “standard deviation.”). Thus, the standard deviation of a random variable is then  . For a r.v. X, we write Var(X) = σ2 (pronounced “sigma-squared”), and SD(X) =√Var(X) = σ Compare that with s2, the variance of the data values, and s, the standard deviation of the data values.

1

Although we will not need to compute the mean and standard deviation of a random variable from its probability distribution, we will need to be able to figure out the mean and standard deviation of combinations of random variables. That’s next. Properties of Combinations of Random Variables (IMPORTANT!) Here is a set of rules about what happens to means, variances, and standard deviations, when a random variable is altered by a simple alteration called a linear transformation, and then random variables are combined in various ways. In the following, ‘a’ and ‘b’ are constants, and X and Y are random variables. 1. Linear transformation: a + bX 𝐸(𝑎 + 𝑏𝑋) = 𝑎 + 𝑏𝐸 (𝑋)

𝑉𝑎𝑟(𝑎 + 𝑏𝑋) = 𝑏 2 𝑉𝑎𝑟(𝑋)

𝑆𝐷(𝑎 + 𝑏𝑋) = |𝑏| 𝑆𝐷(𝑋)

Notice that simply adding a constant has no effect on the variance or standard deviation. 2. Sum of two INDEPENDENT random variables: X+Y 𝐸(𝑋 + 𝑌) = 𝐸(𝑋) + 𝐸(𝑌)

𝑉𝑎𝑟(𝑋 + 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌)

𝑆𝐷(𝑋 + 𝑌) = √𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌)

mean of a sum = sum of the means only if X and Y are independent cannot just add standard deviations (add variances and then take the square root)

Note: If X and Y are not independent you cannot simply add the variances; see 5 below. 3. Difference of two INDEPENDENT random variables: X–Y 𝐸(𝑋 − 𝑌) = 𝐸 (𝑋) − 𝐸(𝑌)

𝑉𝑎𝑟(𝑋 − 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌)

𝑆𝐷(𝑋 − 𝑌) = √𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌)

mean of a difference = difference of the means only if X and Y are independent cannot just add standard deviations (add variances and then take the square root)

Note: As above, if X and Y are not independent you cannot simply add the variances. Important: For independent random variables, the standard deviations of X+Y and X–Y are the same! In our applications, X and Y will always be independent!

2

4. Linear combination of two INDEPENDENT random variables: aX + bY 𝐸(𝑎𝑋 + 𝑏𝑌) = 𝑎𝐸 (𝑋) + 𝑏𝐸 (𝑌)

𝑉𝑎𝑟(𝑎𝑋 + 𝑏𝑌) = 𝑎 2 𝑉𝑎𝑟(𝑋) + 𝑏 2 𝑉𝑎𝑟(𝑌)

𝑆𝐷(𝑎𝑋 + 𝑏𝑌) = √𝑎2 𝑉𝑎𝑟(𝑋) + 𝑏 2 𝑉𝑎𝑟(𝑌)

only if X and Y are independent cannot just add standard deviations (add variances and then take square root)

5. Linear combination of two DEPENDENT random variables: aX + bY Note: If X and Y are not independent, the mean of aX + bY is the same as above, but the variance and standard deviation are different. That is, 𝐸(𝑎𝑋 + 𝑏𝑌) = 𝑎𝐸 (𝑋) + 𝑏𝐸 (𝑌)

𝑉𝑎𝑟(𝑎𝑋 + 𝑏𝑌) = 𝑎 2 𝑉𝑎𝑟(𝑋) + 𝑏 2 𝑉𝑎𝑟(𝑌) + 2𝑎𝑏𝑆𝐷(𝑋)𝑆𝐷(𝑌)𝑟 , where r = corr. coeff.

𝑆𝐷(𝑎𝑋 + 𝑏𝑌) = √𝑎2 𝑉𝑎𝑟(𝑋) + 𝑏 2 𝑉𝑎𝑟(𝑌) + 2𝑎𝑏𝑆𝐷(𝑋)𝑆𝐷(𝑌)𝑟

The formula for variance is especially important in finance, and portfolio balancing. We will not use it in our course because our random variables are almost always independent due to our use of random samples. “Random” means “independent” here. Postscript. Take careful notice of two key aspects. 1. Variances can only be added if the random variables are INDEPENDENT. 2. Never add standard deviations. Example on Combining Random Variables Warren has invested 20% of his funds in T-bills and 80% in a stock index fund. Let X = annual return on T-bills and Y = annual return on stocks. The portfolio rate of return is: R = 0.2X + 0.8Y. Based on annual returns for 1950 to 2000: E(X) = 5.2%; E(Y) = 13.3%; SD(X) = 2.9% ; SD(Y) = 17.0%, what are the expected value (i.e., mean) and standard deviation of R? Solution: E(R) = E(0.2X + 0.8Y) = 0.2E(X) + 0.8E(Y) = 0.2(5.2) + 0.8(13.3) = 11.68% Var(R) = Var(0.2X + 0.8Y) = (0.2)2Var(X) +(0.8)2Var(Y) , assuming X and Y are independent (probably not a very realistic assumption) = (0.2)2(2.9)2 + (0.8)2(17.0)2 = 185.296 SD (R) = √185.296 = 13.61% Note: The incorrect method is to take the weighted average of the standard deviations; i.e., (0.2) SD(X) + (0.8) SD(Y) = 14.18% which is not equal to 13.61%! *****

3

Discrete Probability Distributions There are various named discrete probability distributions: uniform, geometric, Poisson, and binomial (see sections 6.4 and 6.5 in the text). In our course we only need the binomial distribution. The Uniform, Geometric, and Poisson distributions will not be used in our course and are not examinable material. The Binomial Distribution We begin with the idea of a Bernoulli trial. It is a very simple experiment with the following properties. • Two possible outcomes (called success and failure) • Probability of a success is p (probability of a failure is 1-p, we will denote it by q) • Probability is the same from one trial to the next. • The trials are independent The classic example is a coin toss: outcomes are a head or a tail. If it is a fair coin, p=0.5, and this probability is the same for all tosses. The result of one toss doesn’t affect the result of another (i.e., independence). Put a series of Bernoulli trials together and you get a Binomial random variable... Let X be the count of “successes” in n independent Bernoulli trials where the probability of a success on each trial is p. Then we say that X is a Binomial random variable (i.e., X has a binomial probability distribution) with parameter p. The probability of getting k successes in n trials is given by: 𝑛 Pr(𝑋 = 𝑘) = ( ) 𝑝𝑘 𝑞 𝑛−𝑘 𝑘

𝑛 where: ( ) = 𝑘

𝑛! 𝑘!(𝑛−𝑘 )!

We won’t use this formula. It is very cumbersome to use for large n. It is only used when n is small (say, less than 20), and in our applications, with surveys, for example, n will be large. What you do need to know is that: E(X) =  = np Var(X) = 2 = 𝑛𝑝𝑞, where 𝑞 = 1 − 𝑝 SD(X) =  = √ 𝑛𝑝𝑞

(See text for details, if you’re interested in the derivation) Notation alert: Some books don’t use q, so Var(X) = 𝑛𝑝(1 − 𝑝) and SD(X) =√ 𝑛𝑝(1 − 𝑝)

4

Example: The standard example of a Binomial experiment is coin-tossing. Consider 100 tosses of a fair coin. Let X be the number of heads; then X is Binomial with p = 0.5. Therefore E(X) = 100(0.5) = 50 and SD(X) = √ 100(0.5)(0.5) = 5. That is, we expect about 50 heads, plus or minus about 5 heads. The Binomial distribution is used to model binary data, so we will return to it in future lectures. Now it’s time to discuss the world’s most famous distribution, for continuous random variables.

5

Chapter 7. The Normal and Other Continuous Probability Distributions Previously we distinguished between discrete and continuous random variables and their probability distributions. For a discrete random variable, probability is relative frequency, and it is easy to compute by adding relative frequencies in a bar chart. For a continuous random variable, probability is equivalent to “area under the curve.” The simplest example is the Uniform probability distribution for a continuous random variable. (And yes, the term Uniform is used here too; there is a discrete Uniform r.v. and a continuous Uniform r.v.) To begin, consider this question. Can we approximate the shape of a histogram of quantitative data with a smooth curve and express it in a compact mathematical form (i.e., an equation)? Yes, and the smooth curve is called a “density” curve. That brings us to the Normal Model or Normal Distribution. It is by far the most widely used continuous probability distribution, and it is extremely important! The Normal Distribution (aka Normal Model) Many phenomena that produce quantitative data have similarly shaped histograms, which are commonly known as “bell-shaped.” The mathematical function that best describes this shape is called the normal curve or normal distribution. (Statisticians also call it the Gaussian distribution.) Here’s a small sketch of the normal curve.

Although the normal curve is very easy to visualize and quite easy to draw, it is difficult to handle mathematically. The equation of the normal curve is complex (but elegant and beautiful to mathematicians and statisticians):

6

𝑓(𝑥) =

1

𝜎√2𝜋

−(𝑥 −2𝜇)2 } 𝑒𝑥𝑝 { 2𝜎

where μ is the mean of the distribution and σ is the standard deviation of the distribution. We won’t work directly with this equation, but it is impressive! By the way, the text uses an upper-case N and refers to the distribution as Normal. The normal curve really does describe a bewildering range of phenomena. My favorite example is the popping behaviour of microwave popcorn. The intensity of popping follows a normal curve. For the first minute or two you hear nothing, and then the occasional pop. The popping becomes more vigorous until it reaches its peak intensity, and then begins to quiet down just as it began. Of course, you should remove it before it burns, but if you left it in, the second half of the process would be the mirror image of the first half of the process. Listen carefully the next time you pop a bag of popcorn! Since we can approximate a histogram with a smooth curve, then relative frequency in a histogram corresponds to area under the smooth curve. The total relative frequency (or probability) is 1, so the total area under the curve is 100%.

Remember that μ = mean, and σ = standard deviation

7

The graph displays an extremely useful guide to area under the normal curve: The 68-95-99.7 Rule: • 68% of the area under the normal curve lies within μ ± σ • 95% of the area under the normal curve lies within μ ± 2σ • 99.7% of the area under the normal curve lies within μ ± 3σ The data version of the 68-95-99.7 Rule is called The Empirical Rule. For a symmetric, bell-shaped (i.e., normal) distribution: • 68% of the data values are within 𝑥 ± s • 95% of the data values are within 𝑥 ± 2s • 99.7% of the area values are within 𝑥 ± 3s This helps explain why we say that the standard deviation represents the “typical” distance from the mean. About two-thirds of the data values are within one standard deviation of the mean.

The Empirical Rule leads to an extremely useful Rule of Thumb: • s ≈ Range/6 (for bell-shaped distributions and large n) • s ≈ Range/4 (for bell-shaped distributions and small n, approx. 20) Why does this work? The Range is the distance from the Min to the Max. The interval 𝑥 ± 3s almost covers the same distance and has a width of 6s. So, the Range is very approximately equal to 6s. Divide by 6 to get the rule of thumb. These are rough approximations only. Remember that "≈" means “approximately equal to.” Don’t rely on this for exact calculations of the standard deviation. *** Just as a histogram represents the data values of a quantitative variable, the normal curve represents the possible values of a random variable. As usual, we will use X or Y to represent random variables. If a random variable X has a normal distribution with mean μ and standard deviation σ, we write X is N(μ,σ). To compute areas under the normal curve we use a linear transformation to standardize to a standard normal curve. If X is N(μ,σ), then standardize as follows:

𝑍=

𝑋−𝜇 𝜎

8

Then Z is called the “standard normal” and has mean 0 and standard deviation 1 and we write N(0,1). So “standardization” means “subtract off the mean and divide by the standard deviation.” This should look familiar. This is how we computed z-scores a few lectures ago. And now you know why we called them z-scores; they come from the standard normal distribution. We can apply the 68-95-99.7 Rule (aka The Empirical Rule) to Z-values: • 68% of the Z-values lie between –1 and 1 • 95% of the Z-values lie between –2 and 2 • 99.7% of the Z-values lie between –3 and 3. Remember that Z has no units. It is a “pure” number that measures how many standard deviations away from the mean a value lies. Note: We introduced the idea of "standardization" and a "z-score" in Chapter 3, but we used the actual data, not the hypothetical distribution. In that situation, 𝑧

=

𝑥−𝑥 𝑠

.The interpretation is the same.

Question: I can hear you scratching your head already; why have we introduced new notation, namely, μ for the mean and σ for the standard deviation? Isn’t that what 𝑥 and s represented? For now, the answer is that we are making the leap from real observed empirical data to a hypothetical collection of possible values. You can think of the normal curve as a “stylized” description of your real data. And since it is “stylized” it needs its own notation for mean and standard deviation. Here is an illustration of how we will use the normal curve. Suppose you are interested in the behaviour of a quantitative variable such as the height of a population. A histogram of your collected data shows a symmetric bell-shape so that you think it appropriate to summarize the shape with a normal curve. Because the properties of the normal curve are known, you will now be able to compute the chance that the true height of your population exceeds a certain limit or falls within a particular interval. First, however, we need a bit of proficiency at normal curve calculations. Here are some examples.

9

Examples of Normal Calculations Example 1. Suppose X is a measurement variable having a normal distribution with mean 980 and standard deviation 40. Find the probability that X is between 960 and 1060. Pr (960 < X < 1060) = Pr {

960−980 40

<

𝑋−𝜇 𝜎

<

1060−980 40

}

= Pr (-0.5 < Z < 2) Draw a sketch of the area from -0.5 to 2. You know from the Empirical Rule that the area between 0 and 2 is half the area from -2 to 2, or half of 95%, or 47.5%. The area from -1 to 0 is half the area from -1 to 1, or half of 68%, or 34%. Therefore, the area between -0.5 and 0 is a bit more than half of 34%; let’s say about 20%. Adding the two areas gives 47.5%+ 20% or about 67% or 68%. Now that you have a good estimate, based on the Empirical Rule, use Table Z in Appendix C to compute the area exactly. You can find Z = 2 on the right-hand page; the area to the left of 2 is .9772. You can find Z = -0.5 on the left-hand page; the area to the left of -0.5 is .3085. The area we want (check your sketch) is the area to the left of 2 minus the area to the left of -.5; which is: .9772 – .3085 = .6687 or 67% (very close to our estimate). Hence the probability that X is between 960 and 1060 is .67 or 67%

Useful Hints (Strongly Suggested!): 1. Always draw a little picture so you get the correct area defined. 2. Look at the picture and make a rough estimate using The Empirical Rule.

10

Example 2. Consider the Wechsler IQ test; scores for the general population of are normally distributed with mean 100 and standard deviation 15. a) What is the chance that randomly chosen person has an IQ exceeding 120? Pr (X > 120) = Pr (

𝑋−𝜇 𝜎

>

120−100 ) 15

= Pr(Z > 1.33) = 1 – 0.9082 = .0918 or 9.2%

(Note: The Empirical Rule tells us that the area to the right of 1.33 is less than 16% (i.e. less than half of the area outside of -1 to 1, or half of 32% (100%-68%).) To do this in Excel, use the function NORM.DIST. In the “function wizard” (fx), click on “Statistical” under “Function Category”; then click on NORM.DIST and follow the instructions. NORM.DIST(120,100,15,1) = 0.9088. NORM.DIST gives the area to the left of the value of x you give; so if you want the righthand tail you need to subtract the result from 1. Note: The difference between the Excel computation and the table look-up value is due to rounding of 1.333…. to 1.33. b) To belong to Mensa, the high-IQ society, you need an IQ in the top 2% of the population. What IQ score on the Wechsler test is needed? Pr (X > ??) = .02. Find the value that goes in place of “??” That is, find z that has an area of .02 to the right; from the tables, look up an area of .9800. You will find it between z = 2.05 and z =2.06, so take z = 2.055. Now “unstandardize” as follows. Z=

𝑋−𝜇 𝜎

X=μ+Zσ

Hence, X = 100 + (2.055 x 15) = 131 Therefore, you need an IQ of 131 or higher to belong to Mensa. To do this in Excel, use the function NORM.INV and follow the instructions. To be in the top 2% means to be at the 98th percentile. You need to enter 0.98 for “probability”, because you want a z-value with 0.98 area to the left. NORM.INV(0.98,100,15) = 131. Note: You will also find functions NORM.S.DIST and NORM.S.INV. There is an extra S in the function name; that stands for Standard. Experiment with these yourselves to see how they differ from the previous two functions.

11

Example 3. In course A you get 85%, the class average is 80% and SD is 5%. In course B you get 74%, but the class average is 66% and the SD is 4%. Which course did you do better in relative to your classmates? Assume that grades are normally distributed. To compare values from two normal distributions, convert to Z-scores. For course A, Z =

85−80 5

=1

For course B, Z =

74−66 4

=2

You were 2 SDs above the class average in course B (i.e. at the 97.5th percentile) but only 1 SD above the class average in course A (i.e. at the 84th percentile). So, relative to your classmates, you did better in course B even though your absolute grade was lower. Example 4. Consider two investments whose returns are normally distributed. Investment A has a mean return of 10% with a SD of 10%. Investment B has a mean return of 15% with a SD of 20%. Which has a higher probability of losing money (i.e. which has a higher probability of being below 0)? For investment A, Pr (X < 0) = Pr (Z < For investment B, Pr (X < 0) = Pr (Z <

0−10

10 0−15 20

) = Pr (Z < -1)

= 0.1587

) = Pr (Z < -0.75) = 0.2266

Investment B has a higher probability of losing money. (This is not surprising because it has a greater potential reward but also a greater risk.) Example 5. The U.S. Army reports that head circumference among soldiers is approximately normal with mean 22.8 in...