Lecture 5 + 6 notes PDF

Title Lecture 5 + 6 notes
Author John Doe
Course Introductory to Statistics
Institution The University of British Columbia
Pages 12
File Size 354.8 KB
File Type PDF
Total Downloads 319
Total Views 778

Summary

COMMERCE 291 – Lecture Notes 2021 – © Jonathan BerkowitzNot to be copied, used, or revised without explicit written permission from the copyright owner.Summary of Lectures 5 and 6More about Quantitative DataReview from previous notesIn the previous class we learned that there are quite a few numeric...


Description

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

COMMERCE 291 – Lecture Notes 2021 – © Jonathan Berkowitz Not to be copied, used, or revised without explicit written permission from the copyright owner.

Summary of Lectures 5 and 6

More about Quantitative Data Review from previous notes In the previous class we learned that there are quite a few numerical summaries to describe a distribution. They are so important they are reviewed here. Measures of Centre or Location The Centre of a set of data is best summarized by the mean or the median. (A third measure, called the mode, isn’t really used.) • • •

Mean: 𝑥 =

∑ 𝑥𝑖 𝑛

(where the data values are 𝑥1 , 𝑥2 , ... 𝑥𝑛 )

Median: the middle value, after arranging the data in ascending order Mode: the highest point in the histogram

[See previous notes for extra material on three kinds of means.] The mean is the “centre of gravity” of the distribution. The median is middle of the distribution. If the distribution is highly skewed, or if there are outliers, the median is preferred. For symmetric distributions, the mean equals the median. For asymmetric or skewed distributions, the mean and the median are not equal. For a right-skewed (long right-hand tail) distribution the mean is greater than the median. For a left-skewed (long left-hand tail) distribution, the mean is less than the median. The median is a more “robust” measure than the mean; that is, it is not highly affected by the presence of outliers. The Excel function for the mean is AVERAGE, and for the median is MEDIAN. Measures of Spread or Scale The Spread of a set of data is best summarized by the standard deviation (s), or the variance (s2)—the square of the standard deviation— or the interquartile range (IQR). ∑(𝑥𝑖 −𝑥 )2

Standard Deviation = s = √

𝑛−1

Variance = s2

1

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

A quicker but weaker and very approximate measure is the range which is the maximum minus the minimum. The standard deviation (sometimes abbreviated as SD) is the typical distance from a data value to the mean. Quartiles: Q1 = 1st or lower quartile: 25% of the data values lie below Q1. (aka 25th percentile) Q3 = 3rd or upper quartile; 75% of the data values lie below Q3 (25% lie above Q3). (aka 75th percentile (Q2 = Median) The Interquartile Range (IQR) = Q3 – Q1. If the distribution is highly skewed, or if there are outliers, the interquartile range is preferred. The Excel function you should use for standard deviation is STDEV.S, or an older version called STDEV. DO NOT USE: STDEV.P (it uses a denominator of n instead of n–1 in the calculation. For Range use Excel functions MAX and MIN and then subtract. The Excel function you should use for quartiles is QUARTILE.INC, or an older version called QUARTILE. DO NOT USE: QUARTILE.EXC. Excel also has percentile functions: PERCENTILE.INC [or just PERCENTILE] Neither the mean nor the standard deviation is resistant to outliers. They are also a poor choice of summary if the distribution is highly skewed. Base your decision on which numerical summaries to us on the symmetry or skewness of the distribution. Symmetry ➔ Mean and SD Skewness ➔ Median and IQR

2

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Five-number summary and Boxplot A simple but effective way to summarize a distribution, that works regardless of whether there is symmetry or skewness, is the five-number summary and a graph of it, called the boxplot. Five-Number Summary: Min, Q1, Median, Q3, Max Boxplot (its full name is box-and-whisker plot): A graphical display of the median, quartiles, IQR, the min and max, and in an enhanced version, outliers too. Version 1: ___________ |------------------------|_____|_____|------------------------------| Min

Q1 Median Q3

Max

Version 2: Compute the inner fences: Q1 – 1.5 IQR and Q3 + 1.5 IQR

* Min

___________ |---------------- |_____|_____|------------| Q1 Median Q3

*

* Max

In Version 2, each whisker extends to the last data value inside the inner fence; the asterisks represent the outlying values. There are also outer fences: Q1 – 3 IQR and Q3 + 3 IQR. Boxplots and fences were developed by John Tukey. He suggests using the fences as indicators of outliers. Data values beyond the inner fences are called moderate outliers. Data values beyond the outer fences are called extreme outliers. Boxplots can be presented horizontally or vertically. In Excel, they are vertical. Boxplots are not only useful for quickly assessing the general shape of the distribution— whether symmetric or skewed—and the presence of outliers. They are also useful for comparing distributions of several samples. Just put multiple vertical boxplots on the same set of axes.

3

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Illustration: Top 100 Money Winners on the 2019 PGA Tour

Mean Median SD Range Minimum Maximum Q1 Q3 IQR

Dollars 2,572,061 2,135,628 1,531,551 8,608,454 1,075,552 9,684,006 1,519,078 3,170,408 1,651,330

1000s of $ 2,572 2,136 1,532 8,608 1,076 9,684 1,519 3,170 1,651

Comments: This is a skewed distribution. What is the evidence? Answer: Mean ≠ Median Inner fences: Outer fences: Q1 – 1.5 IQR = –957,917 Q1 – 3.0 IQR = –3,434,912 Q3 + 1.5 IQR = 5,647,403 Q3 + 3.0 IQR = 8,124,398 Split the data set by nationality into USA (n = 68) and Other (n =32) and compare results using boxplots.

1=Brooks Koepka; 2=Rory McIlroy; 3=Matt Kuchar; 4=Patrick Cantlay

4

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

As you can see from the previous example, boxplots are also an excellent visual method for comparing groups. Construct boxplots for each group and place them in vertical orientation on the same set of axes. Boxplots works for comparison if the groups are all measured on the same quantitative variable, such as salaries of various job classifications of employees. The groups are the job classifications, salary is the same variable for all groups.

Standardization and Z-scores To compare groups measured on different units, use a method called standardization. To standardize a data value: “Subtract the mean and divide by the standard deviation.” The result is called a z-score. The reason for the name will be discussed later. 𝑥 − 𝑥 𝑧= 𝑠 A z-score is a unitless number that represents the number of standard deviations away from the mean that a data value lies. Z-scores can be used to compare variables measured in different units, as long as the distributions are reasonably symmetric.

Transforming Skewed Data Skewness is not the same thing as outliers. Some distributions can be made to look reasonably symmetric just by the removal of a few outliers. That’s not really skewness. But other distributions are inherently skewed and cannot be made symmetric by removing outliers. In these cases, one method of understanding and summarizing the data is to re-express the data using a transformation. The most common one is to take the logarithm of the data values. Very long tails on a histogram do not look nearly as long on a log scale. This is especially handy for financial data such as salaries which are typically skewed to the right.

5

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Exercises on Graphical and Numerical Summaries (Source: C. Chatfield, Problem-Solving: A Statistician’s Guide) For each of the following data sets decide on appropriate graphical and numerical summaries. a) Marks on a math exam (out of 100 and ordered by size) for 20 students. 30 57

35 58

37 60

40 60

40 62

49 62

51 65

54 67

54 74

55 89

b) The number of days of work missed in one year (ordered by size), for 20 workers. 0 2

0 2

0 3

0 3

0 4

0 5

0 5

1 5

1 8

1 45

c) The number of issues of the monthly magazine read by 20 people in a year: 0 12

1 1

11 0

0 0

0 0

0 0

2 12

12 0

0 11

0 0

d) The height (in metres) of 20 women who are being investigated for a certain medical condition: 1.52 1.65

1.60 1.55

1.57 1.65

1.52 1.60

1.60 1.68

1.75 2.50

1.73 1.52

1.63 1.65

1.55 1.60

1.63 1.65

6

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Answers to Exercises: a) A histogram of exam scores shows a reasonably symmetric bell-shaped distribution. If you see skewness, look again. If your opinion is swayed by the value of 89, cover it up and look again. Suitable summary statistics: mean = 55, median = 56 (average of 55 and 57), std. dev. = 14. Note that the mean and median are almost identical, which is another sign of symmetry. Make sure you haven’t reported too many decimal places; since the data are recorded as integers, one decimal place is more than enough! b) A bar chart of days of work missed is severely skewed to the right. The value of 45 is an outlier, but probably not an error. It will cause a problem in the construction of the bar chart, and it highly influences the mean, which is 4.2 days. The median is 1.5 days and the mode is 0 days. The standard deviation is little help here since the distribution is so skewed. Overall, the summary statistics have little value here and the bar chart is probably the best way to summarize the data. I would also investigate the value of 45; it deserves a special comment in your summary. c) A frequency distribution of number of issues read shows two modes, at zero and twelve. Summarizing a bimodal U-shape is even harder than summarizing a skewed distribution. Neither the mean nor standard deviation are useful here, and worse, are misleading. No one reads half the issues in a year. Rather the readers should be classified as “regular” or “not regular”. With this new categorical (binary) version of the data, you should report the percentage of regular readers, which is 5/20 or 25%. d) Undoubtedly you found the egregious error in the data; the value of 2.50 is most certainly an error, not just an outlier. How should you deal with it? The safest thing to do is to omit it. Although you suspect it should be 1.50 you can’t just go around correcting data to what you think it should be. The worst thing to do is to leave it in uncorrected! After omitting it, the remaining data are reasonably symmetric, so draw a histogram and compute the mean and standard deviation. An interesting piece of forensic statistics… Can you tell what country the data came from? If you examine the final digits, you’ll see that some numbers keep recurring and others never appear. A likely explanation is that the observations were made in inches and then converted to metres. Thus, a good hunch is that the study was done in the U.S. where the metric system is not used! Moral: The calculation of summary statistics depends on the shape of the distributions and on a sensible treatment of errors and outliers.

7

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Example with computations: Norman Statson and his buddies at the Belltown Pub compared their vehicles’ gas mileage. Each participant drove a different car model. For a two-month period they recorded their cars’ mileage and gas consumption. Norman recorded data—miles per gallon (MPG)—for 24 Belltown cars. (The data are real, from a very old EPA study.) Car Mileage Car Mileage 1 24 13 31 2 23 14 32 3 36 15 30 4 27 16 29 5 38 17 28 6 13 18 25 7 31 19 28 8 17 20 21 9 40 21 35 10 50 22 31 11 37 23 25 12 20 24 40 Ordered Data: 3,17,20,21,23,24,25,25,27,28,28,29,30,31,31,31,32,35,36,37,38,40,40,50 Try out the Excel functions (and a manual calculation), to replicate these results. Statistic Manual Excel Function Excel Result Count (n) 24 COUNT 24 Mean (𝑥 ) 29.625 AVERAGE 29.625 (Note 1) 68.245 VAR.S or VAR 68.245 (Note 2) Variance (s2) Std Dev (s) 8.261 STDEV.S or STDEV. 8.261 (Note 3) Minimum (Min) 13 MIN 13 Maximum (Max) 50 MAX 50 Range 37 MAX – MIN 37 Median 29.5 MEDIAN 29.5 (Note 4) QUARTILE.INC(…,1) or Lower Quartile (Q1) 24.5(*) 24.75 QUARTILE(…,1) QUARTILE.INC(…,3) or Upper Quartile (Q3) 35.5(**) 35.25 QUARTILE(…,3) Interquartile Range (IQR) 11.0 Q3 – Q1 10.5 (Note 5) (*) Q1 = between the 6th and 7th observations; take the average of the two = 24.5 (**) Q3 = between the18th and 19th observations; take the average of the two = 35.5 Note 1: Report as 29.6. One decimal is sufficient since the data are integer values. Note 2: Report as 68.2. Note 3: Report as 8.3. Note 4: Since there are an even number of observations; Median = (29+30)/2 = 29.5 Note 5: Both the manual and the Excel results are acceptable answers. Using Excel: Lower inner fence = 24.75 – 1.5 (10.5) = 9.0 (Manual calculation: 8.0) Using Excel: Upper inner fence = 35.25 + 1.5 (10.5) = 51.0 (Manual calculation: 52.0) ➔ No outliers in this data set.

8

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Chapter 4. Correlation and Linear Regression Scatterplots, Association, and Correlation Studying relationships among variables is perhaps the major undertaking of statistical analysis. The generic term for a statistical relationship is association. That’s what we called a relationship between two categorical variables, based on a contingency table or a crosstab. To study relationships between two or more quantitative variables, we need a few definitions. The first three terms have been used before, but they are worth reviewing. Case—an individual subject on which measurements are taken and recorded. Response variable—a variable that measures the outcome; also called “outcome variable” or “dependent variable” or “output variable.” Explanatory variable—a variable that tries to explain or predict the observed outcome; also called “predictor variable” or “independent variable” or “input variable.” The new term is: Scatterplot—a plot of ordered pairs (xi, yi) where each pair is a case. As in standard mathematical graphs using Cartesian coordinates, the independent variable X goes on the x-axis and the dependent variable Y goes on the y-axis. We don’t draw scatterplots of categorical variables because the data are not “scattered” along the measurement scales. For example, consider a graph of heights of fathers on the y-axis and the heights of sons on the x-axis. Each point on the scatterplot corresponds to one (father,son) pair. A scatterplot indicates the association between the response and explanatory variables; that is, how much help is one variable in predicting the other. We investigate association by examining the direction, the form (linear or not), and the strength, all the while looking for outliers. Direction can be positive or negative. Positive means that “as one variable increases, the other one increases”; negative means that “as one variable increases, the other one decreases.” Form addresses whether there is a linear pattern or whether curvature is visible. Outliers are points that are a long way from the main cluster of points, usually in the vertical direction

9

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

That leaves strength. Correlation is the concept and measure of the strength of the clustering around a straight line? Correlation is only used for QUANTITATIVE data and LINEAR patterns. Knowing the mean and SD of X and the mean and SD of Y is not sufficient to measure linear clustering. We need something that captures the connection between X and Y. As an analogy, remember that a dialogue – two people talking and listening to each other – is not the same as two monologues – two people talking but not listening Examples: 1. A scatterplot of salary (Y-variable) versus age (X-variable) shows an inverted Ushape. Although there IS a pattern, it is not linear. The correlation here is 0 because there is no straight-line clustering. 2. Correlation between seat belt wearing, Yes or No (Y-variable) and the Gender of a Driver (X-variable) does not make sense because both variables are categorical, not quantitative. We call the measure of linear association the correlation coefficient and denote it by r. We can’t use the obvious choice “c” since it is used for other quantities. The formula is: 1 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 𝑟= ) ∑( )( 𝑛−1 𝑆𝑥 𝑆𝑦 The Excel function is CORREL. Note: If you had to compute the correlation coefficient manually, here is a useful version of the formula. But don’t worry, you won’t have to do a manual calculation in this course or in modern life, since we have technology to do it. 1 ∑ 𝑥𝑖 𝑦𝑖 − (∑ 𝑥𝑖 )(∑ 𝑦𝑖 ) 𝑛 𝑟= (𝑛 − 1)𝑆𝑥 𝑆𝑦 Properties of r, the correlation coefficient: • -1 ≤ r ≤ +1. It can only be exactly -1 or +1 if all the points lie exactly on a straight line. • The sign indicates positive or negative slope. • r has no units. It is computed in “standard units” like z-scores and is not affected by a change in centre or spread. • The roles of the X and Y variables are interchangeable. For example, the correlation of height and weight equals the correlation of weight and height. • It is sensitive to outliers. • The big one! Correlation is not the same as causation (i.e., cause and effect). If two variables are correlated it means that one can predict another, not that one necessarily causes another.

10

Some warnings about the use and abuse of correlation: #1. Only use it for quantitative variables, not categorical variables #2. Only use it for linear trends, not trends with curvature; r only measures linear association; it doesn’t say anything about non-linear patterns, no matter how strong they are. #3. Be careful of the potentially huge effect of outliers, especially in small data sets. #4. Beware of the effect of outside (“lurking”) variables. #5. Beware of extrapolation—extending beyond where you have data #6. Beware of correlations based on averages. Use all available data. Do not suppress variability or scatter artificially. #7. Beware the perils of aggregation. This is the quantitative analogue to Simpson’s Paradox. #8. One more time: Correlation is not the same thing as causation

More about #4. Beware of the effect of outside (“lurking”) variables. Examples: a) The stork population and human population over time are highly correlated, because of the lurking variable of time. Over time, both variables naturally increase. It is not because storks bring babies! b) In 1950, a study showed a correlation between the number of soft drink sales per week and the number of new cases of polio per week. The explanation was not that soft drinks cause polio, but that during the year, soft drink sales are higher in the summer and that is also when polio incidence is higher. More about #5. Beware of extrapolation—beyond where you have data Example: If it takes you 2 seconds to put together a 2-piece jigsaw puzzle, will it take you 3600 seconds (which is 1 hour) to put together a 3600-piece puzzle?! More about #7. Beware the perils of aggregation Example: Height and weight are positively correlated for basketball players, and also for football players, but if you put basketball and football players together in the same scatterplot you will see a negative correlation between height and weight! Smooth...


Similar Free PDFs