1.3 Descriptive Statistics (Numerical Measures) PDF

Title 1.3 Descriptive Statistics (Numerical Measures)
Author Ho Hyde
Course Statistic
Institution 香港中文大學
Pages 7
File Size 328.3 KB
File Type PDF
Total Downloads 95
Total Views 143

Summary

Download 1.3 Descriptive Statistics (Numerical Measures) PDF


Description

LECTURE 1.3: SUMMARIZING AND DESCRIBING NUMERICAL DATA MEASURE OF CENTRAL TENDENCY 

Arithmetic Mean:

μ= Population Mean:

Σx N

,

where 'x' means “summation of all numbers” and “N” is the population size. Similarly,

x= Sample mean:

Σx n

, where “n” is the sample size.

Example 1: Calculate the mean of the following data set: { 1, 2, 2, 4, 6, 9, 11 } Mean = (1 + 2 + 2 + 4 + 6 + 9 + 11)/7 = 35/7 = 5 

Median: Formally, the median is the value such that 50% of the observations are smaller and 50% of the observations are larger. Intuitively, we can think of it more conveniently as the middle value when all numbers are arranged in increasing or decreasing order. For example, the median of the data set, { 1, 2, 2, 4, 6, 9, 11 }, is 4 because it appears in the middle of the distribution. Note: If the number of scores is even, the median, by definition, can be any number in between the two middle numbers. In this case, we usually take, by convention, the average of two middle numbers as the median. For example, the median of the following data set, { 1, 2, 2, 4, 6, 9 }, is 3 because 3 is the average of the middle two numbers 2 and 4.



Mode: the score that occurs most frequently. For example, the mode of the data set, { 1, 2, 2, 4, 6, 9 }, is 2. A data set can have two modes (bimodal), more than two modes (multimodal), or no mode when no score is repeated. For example:   

the data set, { 1, 2, 2, 2, 3, 3, 3, 5, 7, 9 }, has two modes: 2 and 3, the data set, { 1, 2, 2, 3, 3, 5, 5, 7, 9 }, has three modes: 2, 3, and 5, and the data set, { 1, 2, 3, 4, 5, 6, 7 }, has no mode because each number in the set appears only once.

When considering which the best measure of central tendency is, we usually need to consider the following factors:  

Does it consider all values? Among the above measures, only mean considers all values. Is it easily affected by extreme value? Among the above measures, mode and median are not affected by extreme values. Mean can be affected by extreme values.

1

Relationship between mean and median: 

When data are symmetrically distributed, mean = median 



When data are skewed to the left (negatively skewed), median > mean.

When data are skewed to the right (positively skewed), median < mean.

MEASURE OF VARIATION (DISPERSION) 

Range: Range = Max - Min

The range measures the total spread in the data set. Although the range is a simple measure of total variation in the data, its distinct weakness is that it does not take into account how the data are actually distributed between the smallest and largest values. In fact, the range can be easily affected by the extreme values. 

Percentiles and Quartiles Percentile: The n-th percentile of a data set is a value such that at least n% of data values are less than or equal to it and at least (100 - n)% are greater than or equal to it. Quartiles:   

There are three quartiles, denoted as Q1, Q2, and Q3.

Q1, the 25th percentile, is the value in the data set such that at least 25% of the data points are less than or equal to Q1 and at least 75% are greater than or equal to Q1. Q2, the 50th percentile, is the value in the data set such that at least 50% of the data points are less than or equal to Q2 and at least 50% are greater than or equal to Q2. Q2 is simply the median. Q3, the 75th percentile, is the value in the data set such that at least 75% of the data points are less than or equal to Q3 and at least 25% are greater than or equal to Q3.



Inter-quartiles Range: The inter-quartile range, defined as Q3 – Q1, considers the spread in the middle 50% of the data.



Standard Deviation and Variance The population variance, denoted as 2, is defined as

∑ ( x i −μ )2 σ 2= where

i

N

N: population size xi: ith observation : population mean

Intuitively, variance measures "the average squared-distance between the observations and the mean”. Example: Calculate the variance of the population X = { 2, 4, 4, 6, 5, 6, 8 }.

2

 (x -  ) (x - )2 5 -3 9.00 5 -1 1.00 5 -1 1.00 5 1 1.00 5 0 0.00 5 1 1.00 5 3 9.00 5 0 3.14

x 2 4 4 6 5 6 8 Mean 5

 35 / 7 5 (2  5)  (4  5)  (4  5)  (6  5)  (5  5)  (6  5)  (8 5) 2

2

 

2

2

2

2

2

2

7

 3.14

Graphically,

2

3

4

5

6

7

8

Alternatively, 2 can be computed as

σ 2=

1 1 Σx 2 ) Σx 2 −μ2 = Σx 2 −( N N N

In the above example,

=

x2 4 16 16 36 25 36 64 197

x 2 4 4 6 5 6 8 35

Since N = 7, x2 = 197, and  = x/N = 35/7 = 5,

2 

1 2 1  x   2  (197)  5 2 3.14 N 7

The population standard deviation, , is simply the square root of 2, i.e.,

σ =√ σ 2

Hence, in the above example   3.14 1.77 The sample variance, denoted as s2, is defined as 2

s2=

where

x) Σ( x i−¯ n−1

xi

: ith observation x¯ : sample mean n : sample size

Hence, if we have a sample {1, 2, 4, 6, 7}, the sample variance is s2 

(1  4) 2  (2  4) 2  (4  4) 2  (6  4) 2  (7  4) 2 6.5 (5  1) ,

and the sample standard deviation is

3

s  6.5 2.55

Similarly, s2 can be computed by using the following alternative formula: 2

s2 =

2

( )( )

n Σx Σx Σx n 2 ¯x = − − n−1 n−1 n−1 n−1 n

2

2

=

( Σx) Σx2 − n−1 ( n−1)( n )

In the above example

s2 =

106 202 4×5 4

= 26.5 - 20 = 6.5, and

s  6.5 2.55

Why is sample variance defined as s2 = (x - ¯x )2/(n - 1)? Not as s2 = (x - ¯x )2/n? Notice that if we use the first formula, the sample variance so calculated is called an “ unbiased estimator” of the population variance. The reason that it is called an “unbiased estimator” for the population variance is because that the expected value (or simply the mean) of the sample variance (calculated from the first formula) is equal to the population variance, i.e., E( s2) = 2. However, if we use the second formula to calculate s2, then E(s2)  2 as we will now demonstrate. Consider the case when we have a population X = {1, 2, 3, 4, 5} in which we want to take samples of size 2 with replacement. The table on the right lists all possible samples, their means, and their sample variances (there are two sample variances, one is calculated with the first formula s2 = (x- ¯x )2/(n-1), the other with the second formula s 2 = (x- ¯x )2/n). As we notice from the table, when s2 is calculated using (x- ¯x )2/(n-1), the mean of all these s2 = population variance  2 = 2. However, when s2 is calculated using (x- ¯x )2/n, the mean of all these s2 = 1  population variance 2 = 2. Sometimes statisticians use the second formula to calculate a sample variance. If that is the case, the sample variance so calculated is called a “ biased estimator” of the population variance, because the expected value of such a sample variance is not equal to the population variance.

x1

x2

1 1 1 2 1 3 1 4 1 5 2 1 2 2 2 3 2 4 2 5 3 1 3 2 3 3 3 4 3 5 4 1 4 2 4 3 4 4 4 5 5 1 5 2 5 3 5 4 5 5 Mean =

¯x 1.00 1.50 2.00 2.50 3.00 1.50 2.00 2.50 3.00 3.50 2.00 2.50 3.00 3.50 4.00 2.50 3.00 3.50 4.00 4.50 3.00 3.50 4.00 4.50 5.00 3.00

(x -

¯x

)2/(n-

1) 0.00 0.50 2.00 4.50 8.00 0.50 0.00 0.50 2.00 4.50 2.00 0.50 0.00 0.50 2.00 4.50 2.00 0.50 0.00 0.50 8.00 4.50 2.00 0.50 0.00 2.00

(x- ¯x )2/n 0.00 0.25 1.00 2.25 4.00 0.25 0.00 0.25 1.00 2.25 1.00 0.25 0.00 0.25 1.00 2.25 1.00 0.25 0.00 0.25 4.00 2.25 1.00 0.25 0.00 1.00

What does the standard deviation tell us? Chebyshev’s Theorem





100 1  k12 % For any set of data values, at least of the values must lie within  k standard deviation from the mean, where k>1. Hence, this essentially says that  

75% or more of the data values must lie within  2 standard deviations from the mean, and 89% or more of the data values must lie within  3 standard deviations from the mean.

If the data are unimodal and symmetrical, then the Chebyshev’s Theorem can be made even more precisely as

[

]%

9 (k ) “at least this essentially says that

100 1−( 4 )

 

1

2

of the data values must lie within  k standard deviation from the mean”. Hence,

55.6% or more of the data values must lie within  1 standard deviation from the mean, 88.9% or more of the data values must lie within  2 standard deviations from the mean, and

4



95.1% or more of the data values must lie within  3 standard deviations from the mean.

For bell-shaped distributions, based on the Empirical Rule, approximately   

68% of the data values lie within  1 standard deviation from the mean, 95% of the data values lie within  2 standard deviations from the mean, and 99.7% of the data values lie within  3 standard deviations from the mean.

Applications of Chebyshev's Theorem A data set has a mean of 100 and standard deviation of 10. Without knowing the distribution of the data set, a)

At least what percentage of the data values must be between 80 and 120. 80 is 2 below the mean 120 is 2 above the mean Hence, 80 to 120 is within 2 from the mean By the theorem, at least 75% of the data values must be within this range

b) At most what is the percentage of the data values that are either less than 70 or greater than 130? By the theorem, at least 89% of the data must be within 3  . Hence, at most 11% can be either greater than 130 or less than 70. c)

If the data set is unimodal and perfectly symmetrical (with respect to the mean), then, what percentage of the data values must be between 80 and 120? Between 70 and 100? Between 100 and 110? At least 88.9% are between 80 and 120. At least 47.55% are between 70 and 100, and At least 27.78% are between 100 and 110.

Food for thought: Why do we use “the square root of average squared distance from the mean” as a measure of dispersion, instead of using simply “average difference from the mean”? In other words, why do you think



σ= statistician define the 1    ( x i  ) N i ?

standard

deviation

as

1 ∑ ( x i−μ )2 N i

,

instead

of

simply

as

Understanding Variation in Data     

The more spread out, or dispersed, the data are, the larger will be the range, the interquartile range, the variance, and the standard deviation. The more concentrated, or homogeneous, the data are, the smaller will be the range, the interquartile range, the variance, and the standard deviation. If the observations are all the same (so that there is no variation in the data), the range, the interquartile range, the variance, and the standard deviation will be zero. None of the measures of variation (the range, the interquartile range, the variance, and the standard deviation.) can ever be negative.

Measure of Position – Z Score: Z score (The standard score) measures “the number of standard deviations that a given value x is above or below the mean”.

Z=

x−μ σ

for population

5

=

x−x s

for sample

Example: For a data set with  = 50 and  = 4, a) the value 52 has a Z score equal to 0.5, and hence, is 0.5 standard deviation above the mean. b) the value 47 has a Z score equal to - 0.75, and hence, is 0.75 standard deviation below the mean. Comparing data values with Z score: A student took 2 exams, Statistics and Economics. The statistics of the class performance are as follows:

Statistics Economics

His Score 70 82

Class Mean 60 80

Class Std. Dev. 2 4

Z 5 0.5

In which exam, do you think, did this student do better? Statistics, because he is 5 standard deviation above the mean in the first exam (which is really unusual) and is only 0.5 standard deviation above the mean (which is just slightly above average) in the second mean.

MEASURE OF ASSOCIATION: COVARIANCE AND CORRELATION Covariance is a measure of linear relationship between two numerical variables. The sample covariance of X and Y is: n

∑ ( X i− ¯X )(Y i −Y¯ ) Cov( X , Y )= i=1

n−1

It is easy to see from the above formula that, when two variables are positively related, their covariance will be positive. Similarly, when they are negatively related, their covariance will be negative. The limitation of covariance as a descriptive measure is that it is affected by the units in which the two variables are measured. As another measure of linear relationship between two variables, the correlation coefficient, defined below, is scale-independent.

Corr ( X ,Y )=

Cov( X , Y ) Stdev ( X )× Stdev(Y )

The correlation coefficient is always between –1 and +1, with –1 indicating perfect negative relationship and +1 indicating perfect positive relationship. The correlation is 0 for two variables that have no linear relationship. Example: Calculate the sample covariance and correlation of the following two sets of samples, X and Y: X = {1, 3, 5, 4, 7} Y = {3, 7, 6, 10, 9}. The Scatter Plot is shown below.

6

First, you calculate the means: X = 4 and Y = 7. Then, using the formula given above, the sample covariance of X and Y is: X 1 3 5 4 7

(Y – Y ) (X – X ) -3 -4 -1 0 1 -1 0 3 3 2 Cov(X, Y) = (X – X )(Y – Y )/(n - 1) = Y 3 7 6 10 9

(X – X )(Y – Y ) 12 0 -1 0 6 17/4 = 4.25

You should also verify that the sample standard deviations are sX = 2.2361 and sY = 2.7386. Hence, using the formula given above, the sample correlation coefficient of X and Y is: Corr(X, Y) = 4.25/(2.236  2.7386) = 0.694. Question for discussion: Correlation measures only the strength of linear relationship between a pair of variables. It cannot measure non-linear relationship. It is possible for a pair of variables to have zero correlation even if they have perfect non-linear relationship (i.e., given the value of one variable, the value of other variable can be predicted perfectly). Can you think of any numerical example in which such case arise?

OBTAINING SUMMARY STATISTICS WITH EXCEL Excel Functions: There are many Excel functions that can calculate all kinds of different sample statistics. Some of the frequently used include: AVERAGE, STDEV (for sample standard deviation), STDEVP (for population standard deviation), VAR (sample variance), VARP (population variance), COUNT, MAX, MIN, MEDIAN, MODE, CORREL, and COVAR. To find how to use these functions, select Insert/Function (or click on function wizard). Then, select Statistics to see the list of statistical functions available. Choose the function that you want to use and then simply follow the instructions to complete all the steps.

7...


Similar Free PDFs