Topic 3 - Measures of central tendency and dispersion extra notes PDF

Title Topic 3 - Measures of central tendency and dispersion extra notes
Course Economics in Action
Institution Queen Mary University of London
Pages 10
File Size 412.2 KB
File Type PDF
Total Downloads 44
Total Views 145

Summary

Lecturer: Guglielmo Volpe
Extra notes for topic 3...


Description

School of Economics and Finance ECN125 Economics in Action Topic 3 – Measures of Central Tendency and Dispersion We look at some key statistics that are often used in the analysis of data. We will focus our attention on measures that define the ‘centre’ and the ‘dispersion’ of either the population or a sample of data. These statistics are important in defining the key characteristics of a dataset and play a key role in inferential statistical analysis that we will discuss later on in the semester. 1. Measures of Central Tendency: in describing the shape of the distribution of a sample or population of measurement, we are also interested in describing the data set’s central tendency. A measure of central tendency represents the centre or middle of the data 



Population and Sample: from lecture 1 you will remember that when you collect data you may write either a population or a sample from the population. Numerical measures calculated from the data are known as either statistics or parameters 

Statistics: this is a numerical descriptor that is calculated from sample data and is used to describe the sample. Statistics are usually represented by Roman letters



Parameter: is a numerical descriptor that is used to describe a population. Parameters are usually represented by Greek letters

The Mean: you are probably already familiar with this measure of centre. The mean, or average (as it is commonly referred to), is calculated by adding all of the data values and then dividing by the number of values: 

The Sample Mean: is the centre of balance of a set of data and is found by adding up all of the data values and dividing by the number of observations. Usually the sample mean is denoted by a Roman letter (usually X or x) with a bar on top x



Sum of all the values in the sample Total Number of Observatio ns

The Population Mean: this is the average of the population measurements and it is denoted by the Greek letter  (pronounced mew). The population mean is computed in the same way as the sample mean. The only difference between the two is that the population mean is computed over all population values while the sample mean is computed over a sample of the population

Example The following table shows the 2004 GDP per capita (in $) for a sample of 10 EU countries: Bulgaria 8,620

Belgium 25,885

Portugal 17,400

France 26,168

Lithuania Greece 12,382 16,488

Italy 23,175

Norway Poland 34,759 9,704

Spain 20,976

Question: What is the sample average for this 10 countries? Answer: the sample average GDP per capita is computed by summing up the 10 values and dividing the sum by the number of countries (the sample average is denoted with the letter Y since this is the letter commonly used to denote income or GDP in economics): 8620 25885 17400 26168 12382 16488 23175 34759 9704  20976  Y  19555.7 10 The sample average GDP in these 10 EU countries is $19,555.7



The Mathematics of the Sample Mean: in order to write a formula for the sample mean, we employ the letter n to represent the number of sample measurements and we refer to n as the sample size. Furthermore we denote the sample measurements as x1, x2, x3,…xn. Here x1 is the first sample measurement, x2 is the second sample measurement and so forth. We denote the last sample measurement as xn. Moreover, when we write formulas we often use summation notation for convenience. For instance we write the sum of the sample measurements by using the Greek letter sigma  (the index i=1 to n says that we let the subscript i in the general term xi range from 1 to n and we add up all these terms): n

x1  x2  ...  x n   xi i 1

So, given this notation the mathematical formula for the sample mean is given by: n

x 

x i 1

n

i



x1  x 2  ...  xn n

The Median: this is another measure of central tendency of a population or a sample which provides a measure of the middle of the data set after it is sorted from lowest to highest 

Some Problems with the Sample Mean: the value of the sample mean is influenced by unusually high or low values in the sample and might not present a true picture of the sample data. For this reason we often look at other measures of centre in addition to the sample mean



Sample Median: it is the value of the middle observation in an ordered set of data. Finding the sample median requires sorting the data set first. Once this is done, the sample median is the value of the observation that is in the middle of the data



Location of the Middle: the exact location of the middle depends on whether the number of observations in the sample is even or odd o

Odd Number of Observations: in this case the median is the value of the observation in the (n+1)/2 position

o

Even Number of Observations: in this case the median is the value of the observation in the n/2 and n/2+1 position

Example Let’s refer to the example we used above. To compute the sample median GDP per capita we need to first sort out the data from the smallest GDP per capita to the highest: Bulgaria 8,620

Poland 9,704

Lithuania 12,382

Greece 16,488

Portugal 17,400

Spain 20,976

Italy 23,175

Belgium 25,885

France 26,168

Norway 34,759

Given that the number of observations is even (10) the sample median is given by the average between the GDP per capita in position 10/2=5 and 10/2+1=6. Portugal’s and Spain’s GDP per capita are, respectively, the fifth and sixth GDP per capita in the table. Thus, the sample median is given by (17,400+20,976)/2=19,188 As you can notice, the sample median is similar but not identical to the sample mean. 

The Mode: there is another measure of centre that is used in statistics to measure the centre of data of a sample or population o The Sample Mode: it is the data value that has the highest frequency of occurrence in the sample o The Modal Class: for continuous data, where there are many different possible values, we do not usually talk about the mode because in some cases there might not be any repeated values in the sample. In this case we make reference to the so-called modal class which is the class interval in a frequency distribution or histogram that has the highest frequency

Example A large retailer of women’s clothing is trying to obtain some information that will help it formulate an ordering policy for clothing size. The retailer decides to look at a single line of apparel and collect data on the sizes of the items sold in a 2-week period. A frequency table of the data shows: Size Frequency

6 1

8 2

10 12

12 10

14 4

16 1

Question what is the sample mode? Answer: the size with the largest frequency is size 10 with 12 observations. Hence, size 10 is the sample mode. Question: Can you work out the sample mean and the sample median for this problem? [Answer: sample mean: 11.1; sample median: 11] 

Bimodal and Multimodal Samples: sometimes the highest frequency occurs at two or more different measurements. When this happens, two or more modes exist. When exactly two modes exist, we say the data is bimodal. When more than two modes exist, we say the data is multimodal

 Comparing the Mean, Median and Mode (see graph in appendix 1)  when the frequency distribution of a sample measurement is approximately symmetrical then the

sample mean, median and mode will be nearly the same When the frequency distribution is skewed to the right the sample mean is larger than the sample median and the sample median is larger than the sample mode  When the distribution is skewed to the left, the sample mean is smaller than the sample median, and the sample median is smaller than the sample mode





Weighted Means: in order to calculate the mean we sum the population (or sample) measurements and then divide this sum by the number of measurements in the population (or sample). When we do this each measurement counts equally or, in other words, each measurement is given the same importance or weight. However, in some cases it would make sense to give different measurements unequal weights  

Example: for an example see Appendix 2 at the end of these notes Formula to Compute the Weighted Mean: the weighted mean is computed by summing up the products between the statistics of interest and the chosen weight and dividing it by the sum of the chosen weights: n

w x i



i

i 1 n

w

i

i 1

where: xi = is the value of the xi measurement wi = the weight applied to the ith measurement n = is the sample or population size 

Descriptive Statistics for Grouped Data: often the only data available are in the form of a frequency distribution or a histogram. Data in this form is usually referred to as grouped data (please refer to frequency tables in lecture 2!). The question we should has ourselves is: how can we compute the descriptive statistics (mean and variance) for such data? 

Example: for an example see Appendix 3 at the end of these notes



Formulae to compute the Sample Mean for grouped data: the general formula to compute the sample mean for grouped data is then given by: n

x

 f iM i i 1

n

f

n

fM i



i

i1

n

i

i 1

where: fi = is the frequency of class i Mi = the midpoint for class i n = is the sample size

2. Measures of Dispersion or Spread: generally, simply describing the centre of data or a typical data value does not provide complete information about the data set. In addition to knowing what a typical value for the sample is, it is important to know how diverse the values in the sample can be. That is, we need to know how spread out or dispersed the data values are relative to the typical values 

The Range: consider a population or a sample of measurements. The range of the measurements is the largest measurement minus the smallest measurement Example Consider our first example concerning the GDP per capita in a sample of 10 EU countries. In this case the range is given by the difference between the largest GDP per capita (Norway) and the smallest one (Bulgaria): R = 34,759 – 8,620 = 26,139 o

Limitation of Range: the range is heavily influenced by unusual or extreme values in the sample. When you calculate the range you use only the extreme values (two measurements) so when these are unusual they have a large impact on the statistics. As a rule when the sample size is more than 25 the sample range should not be used as a measure of variability



Population Variance 2: (pronounced sigma squared): is the average of the squared deviations of the individual population measurements from the population mean 



Population Standard Deviation : (pronounced sigma) is the positive square root of the population variance

Example Consider the population of profit margins for five of the best big companies in America as rated by Forbes magazine in 2005. The profit margins are: 8%, 10%, 15%, 12% and 5%. Questions: what are the population variance and population standard deviation? Answer: to answer these questions we need first to compute the population mean which is given by: 8  10  15  12  5 50   10 5 5 Secondly, we need to calculate the deviations of the individual populations measurements from the population mean  =10 as follows: (8-10)=-2, (10-10)=0, (15-10)=5, (12-10)=2, (5-10)=-5 Then we compute the sum of he square of these deviations: (-2)2 + (0)2 + (5)2 + (2)2 + (-5)2 = 58 Finally we calculate the population variance s2, the average of the squared deviations, by dividing the sum of the squared deviation, 58, by the number of squared deviations 5:

58  11.6 5 The population standard deviation is then given by the squared root of s2: 58  3.406   5 This tells us that although the average profit margin is 10%, the actual profit margin will vary from that value. Typically, the variations or differences from the average will be about 3.4%.

2 



Interpretation of Variance and Standard Deviation:



Sample Variance and Sample Standard Deviation – The maths! – when a population is too large to measure all the population units, we estimate the population variance and the population standard deviation by the sample variance (denotes by s2) and the sample standard deviation (denoted simply by s). The mathematical expressions for the computation of the two statistics are: n

s  2

s 

 (x

i

 x )2

i 1

n 1 s

2



( x 1  x ) 2  ( x 2  x ) 2  ...  ( x n  x ) 2 sample variance n 1

sample standard deviation

A simple way to compute the sample variance: the variance can actually be computed by using a formula that reduces considerably the number of calculations you have to perform: 2

n  n  n x i2   x i  i 1  i 1  s2  n( n  1)



Example: see appendix 4 for an example on how to compute the sample variance and standard deviation



Coefficient of Variation (CV): to compare dispersion in data sets with dissimilar units of measurement (e.g., kilograms and ounces) or dissimilar means (e.g., home prices in two different cities) we define the coefficient of variation (CV) which is a unit free measure of dispersion: s CV  100  x  From the formula we can see that the CV is the standard deviation expressed as a percent of the mean  The main weakness of the CV is that it is undefined if the mean is zero or negative, so it is appropriate only for positive data



Mean Absolute Deviation (MAD): this is an additional measure of dispersion that reveals the average distance from the centre: n

MAD 

x

i

x

i 1

n

 How can we compute the variance and the standard deviation if the data is organised as grouped data?

o

Formulae to compute the Sample Variance for grouped data: the general formula to compute the sample variance for grouped data is then given by: n

 f (M i

var( x) 

i

 x) 2

i 1

n 1

where: fi = is the frequency of class i Mi = the midpoint for class i n = is the sample size x is the sample mean for the grouped data 

What is our interpretation of the Standard Deviation? – The Empirical Rule: the standard deviation is not as intuitive and as appealing as the sample range. One way to understand what information the standard deviation gives is to the empirical rule The Empirical Rule: suppose that we have a population with mean  and standard deviation  and is described by a symmetric distribution (a distribution with these characteristics is usually referred to as a normal distribution); in this case the empirical rule states:



o o o



Tolerance Interval: in general, the interval that contains a specified percentage of the individual measurement in a population is called a tolerance interval. This means that one, two and three standard deviation intervals around  are tolerance intervals containing respectively about 68%, 95% and 99% of the measurement in a normally distributed population o

 

Notice!: of course often we do not usually know the true values of  and  . Therefore we must estimate the tolerance interval by replacing  and  in these intervals by the mean x and standard deviation s of a sample that has been randomly selected from the normally distributed population. However, we need to make sure that the sample size is large enough to guarantee that they are good estimates!

Example – See appendix 5 for an example!

Is the Empirical Rule still valid if the Distribution is not Normal? If we are dealing with a set of observations whose distribution is not humped shaped or normal, the empirical rule cannot be used. In this case we can consider using the so-called Chebyshev’s Theorem to find an interval that contains a specified percentage of the individual measurements in the population 

Chebyshev’s Theorem: consider any population that has a mean  and standard deviation . Then, for any value of k greater than 1, at least 100(1-1/k2)% of the population measurements lie in the interval [  k ] o



About 68% of all observations are within one standard deviation of the mean About 95% of all observations are within two standard deviations of the mean Almost all (more than 99%) of the observations are within three standard deviations for the mean

Example: if we choose k=2, then at least 100(1-1/22)% = 100(3/4)% = 75% of the population measurements lie in the interval [  2 ] . As another example, if we choose k equal to 3, then at least 100(1-1/32)% = 100(8/9)% = 88.89% of the population measurements lie in the interval [  3 ]

z-scores: we can use the empirical rule to determine the relative location of any value in a population or sample by using the mean and standard deviation to compute the value’s z-score. In other words, the z-score measures the number of standard deviations that a data value is from the mean. For any value of x in a population or sample, the z-score corresponding to x is defined as follows: x  mean x   z  standard deviation 

As in the empirical rule, for sample data we substitute x and s for  and  , respectively



The z-score is also called the standardised value, and is the number of standard deviations that x is from the mean. A positive z-score indicates that the data value is above the mean, whereas a negative z-score indicates that the data value is below the mean

Readings Newbold P., Carlson W., Thorne B., (2012), Statistics for Business and Economics, 8th Ed., Pearson, Chapter 2 Doane D., Seward L., (2013), Applied Statistics in Business and Economics, 4th ed., chapter 4, sections 4.14.4, 4.7 Lind D., Marchal W., Wathen S., (2011), Statistical Techniques in Business and Economics, 15th ed., chapter 3 Bowerman B.L., O’Connell, Murphree E., (2011), Business Statistics in Practice, 6th ed., chapter 2: sections 2.2, 2.3, 2.4, 2.8 Nieuwenhuis G., (2009), Statistical Methods for Business and Economics, chapter 3 Pelosi M.K., Sandifer T.M., (2002), Doing Statistics for Business in Excel, 2nd ed., chapter 4 Anderson D. R., Sweeney D.J., Williams T.A., Freeman J., Shoesmith E., (2007), Statistics for Business and Economics, chapter 3, sections 3.1, 3.2, 3.3, 3.4, 3.6

Appendix 1 – Comparing Mean, Median and Mode

Appendix 2 – Weighted Mean Example The following table reports statistics for the labour force (economically active individuals) and the unemployment rate for a set of regions in the England. Regions North East North West Yorkshire East Midland West Midland East London

Economically Active 1240 3379 2578 2263 2660 2921 3939

Unemployment rate 6.3 5.9 5.7 5.3 6.3 4.7 6.8

Question: what is the unemployment rate in England? Answer: if we wish to compute the mean unemployment rate for England we should use a ...


Similar Free PDFs