Topic 4 Part 1 Skewed Distributions PDF

Title Topic 4 Part 1 Skewed Distributions
Author reg filangee
Course Quantitative Methods
Institution Dawson College
Pages 24
File Size 516.4 KB
File Type PDF
Total Downloads 28
Total Views 212

Summary

detailed lecture notes...


Description

1

TOPIC 4: The “Look of Average”: Skewed Distributions Key Terms: Skewed Means Skewed Distributions Outlier/ Outliers Frequency Polygon Baseline (horizontal axis) Ordinate (vertical axis) Peak (highest point of frequency polygon’s curve, where mode is located) Tail (part of frequency polygon’s curve that stretches most) Perfectly Symmetrical Distribution/ Perfectly Symmetrical Curve Perfectly Bell-Shaped Curve Symmetrical Distribution/ Symmetrical Curve Positively Skewed Distribution/ Positively Skewed Curve Negatively Skewed Distribution/ Negatively Skewed Curve ____________ This topic deals with what is meant by skewed means and skewed distributions as well as the significance of these terms. It also deals with what I like to call the “look of average”, or the shape of a curve when rankable variables, like numeric variables, are graphed. These are more complex subjects than those previously covered in the course, but they should be fairly easy to grasp with practice and if the concepts of mean and median (topic 3) have been properly understood. Since this topic requires the calculation of the mean, it should go without saying by now that it only apples to NUMERIC variables because the mean (or mathematical average) cannot be found on nominal or ordinal data. __________________

2

Recall that in the class just before the mid-term break, the term skewed mean was introduced to describe a mathematical average that is distorted. The word skewed, in fact, generally indicates something distorted. The example of six people’s annual incomes was provided to show how the mathematical average could be distorted, or skewed, by a numeric variable that is too big in relation to the others. Annual Incomes, n= 6 $ 5,000 $ 5,000 $ 30,000 $ 40,000 $ 50,000 $ 400,000 ________ Σ 530,000

530,000 / 6 = 88,333.33

When the total of these 6 incomes is divided by 6 people, the mean obtained is $ 88,333.33, an average income that is much larger than that of 5 of these 6 people. This mathematical average is obviously being distorted, or skewed, by the person earning $400,000. This $400,000 income would be called an outlier, in this case. It is creating a skewed mean or skewing this distribution. Numeric data can have one outlier or several outliers. An outlier can be a number that is much larger or, by contrast, much smaller than the others, although outliers that are much larger are usually the ones that create noticeable mean distortions. Correcting Skewed Means Recall, as well, that one way to correct such a skewed distribution in order to obtain an average that better reflects the reality of the sample is to eliminate the outlier and recalculate the mean. In the case of the example above, if the $400,000 income is eliminated the total would equal $130,000. Dividing $130,000 by now 5 people equals $26,000, a mean which is much closer to most of these people’s annual income. Another way to correct a skewed mean is to keep the outlier but to use the median instead as a type of average. The median is the exact midpoint of data that can be ranked. Annual Incomes, n= 6 $ 5,000 $ 5,000 $ 30,000 $ 40,000 $ 50,000 $ 400,000

3

Since there are 6 incomes in this sample (6 is an even number so there must be two middle people), the incomes of the 3rd and 4th ranked persons are the exact middle (they both have 2 people at their side). Because there are two middle incomes, they must be added and divided by 2. So $70,000 / 2 = $35,000. This median income of $35,000 is also an income that better reflects “on average” what most of these 6 people earn. In other words, this median income of $35,000 is more realistic for most of the 6 people in the sample than the actual mean income of $88,333.33, which is very much a skewed mean caused by the single person earning $400,000. _______________

Skewed means are very common It must be emphasized that most numeric distributions have a mean that is skewed. The skew may be very large because of outliers. Or the skew may be very small and not created by outliers at all. Many distributions do not have outliers and yet their mean is usually skewed to some degree. Please keep this in mind. What creates any kind of skew, whether large or small, is simply the following: IF THE MEAN IS NOT THE SAME AS THE MEDIAN, THEN THE DISTRIBUTION IS SKEWED, or we can also say the mean is skewed. It does not matter if the skew is large or small, it is still a skew. To illustrate this, consider the following example.

Sleep Hours per Night, n=300 Hours 3 4 5 6 7 8 9 10 ___________ Σ

f 5 20 50 100 80 30 10 5 _______ 300

Calculate the mean and median of this distribution. Correctly round the mean to one decimal place. Try doing this without looking at the answers which appear on the following page.

4

Sleep Hours per Night, n=300 Hours 3 4 5 6 7 8 9 10 _________ __ Σ

f X*f 5 15 20 80 50 250 100 600 80 560 30 240 10 90 5 50 _____ __________ __ __ 300 1885

The mean is 6.3 hours (1885 /300 = 6.28) The median is 6 hours (looking for middle of 300 people, 300 / 2 = 150, since 300 is an even number there will be 2 middle people, or persons 150 and 151, whether counting down or up the frequency column, both persons 150 and 151 will be found in the 6 hours category.)

The mean and median of this distribution are not the same, hence this is a skewed distribution. We can also say that the mean is skewed in this case.

If this data is graphed, it would be appropriate to use a frequency polygon, what is sometimes called a line graph, although frequency polygon is the more correct term for this kind of graph. Please see what this graph would look like on the next page.

5

Sleep Hours per Night, n=300 120 100 80 f

60 40 20 0 3

4

5

6

7

8

9

10

Hours

A frequency polygon always arranges the categories of the variable, in this case sleep hours, on the horizontal axis (or x axis), while the frequencies are always arranged on the vertical axis (or y axis). The name for a frequency polygon’s horizontal axis is the baseline, while the name of a frequency polygon’s vertical axis is the ordinate (because the number of people in each variable category is ordered here).

6

The look of this curve is almost perfectly bell-shaped but not quite. This indicates some kind of skew, which does exist since the mean (6.3 hours) is not the same as the median (6 hours).

NOTE: It is often expected that numeric distributions will produce something that almost looks like a bell curve because when numeric variables like sleep hours per night are measured on people, we expect fewer people to be found in the lowest or highest sleep hour categories while we expect more people to be found in the middle sleep hour categories. To put it more simply, the expectation is that people will tend to “bunch up”, or centralize, in the middle of numeric variable categories while fewer people will tend to be in the low or high categories of such variables. That is what I mean by the “look of average”. As another example, imagine that a sample of people is asked their weight (a numeric variable) and this would then be graphed with a frequency polygon. We would expect something like a bell curve with few people at extreme low or extreme high weights and most people somewhere in between these.

Determining the TYPE OF SKEW and CURVE of NUMERIC VARIABLES There are different types of non-skewed and skewed distributions, which in turn produce different types of curves. The type of skew that a distribution has is important because, as will be seen, it helps us to assess other characteristics of a sample.

1) Perfectly Symmetrical Distribution producing the Perfectly Symmetrical Curve The perfectly symmetrical distribution is extremely rare in real life and is defined as a distribution whose measures of central tendency are all exactly the same, or whose mode, median and mean are all equal, down to the last decimal.

Assume, for instance, that 9 college students were asked to indicate, on a scale of 1 to 5, their stress level before a test, with 1 representing not stressed at all and 5 representing extremely stressed. The results were as follows.

Pre-Test Stress, n=9 Stress Level 1 2 3 4 5

f 1 2 3 2 1

7

Σ

9

Calculate the mode, median and mean on your own before verifying the answers on the next page.

Pre-Test Stress, n=9 Stress Level 1 2 3 4 5

f 1 2 3 2 1

X*f 1 4 9 8 5

Σ

9

27

The measures of central tendency are all exactly the same here. Mode = stress level of 3 Median = stress level of 3 (looking for middle of 9 people, only need one person because 9 is an odd number, 9/2 = 4.5, hence need 5th person who is in stress level 3 category) Mean = stress level of 3 (x*f = 27,

27/9 = 3)

8

Because all three measures of central tendency are exactly the same, the mean obviously equals the median, and WHEN THE MEAN IS EQUAL TO THE MEDIAN, THERE IS NO SKEW. Note: when a distribution does not have a skew, the mean is playing the role of the median (because the two are the same) and that indicates that, like the median, which always divides the sample in half, the mean of a non-skewed distribution also divides the sample in half. For example, if these same 9 students are not presented in a table but simply displayed in a row and ranked by stress level, the data would look like this: 1

2

2

3

3

3

4

4

5

The fifth student who divides the sample in half (since he or she has 4 students on either side) represents the median stress level of 3. However, this median level of 3 also represents the mean stress level of 3 and therefore the mean also divides the sample in half. This creates a SYMMETRY (or sameness) in the distribution in the sense that the same number of people (four in this case) will be found on either side of the mean of 3 because the mean, like the median (which is also 3), is splitting the sample in half. Again, when the mean is the same as the median (or the mean also divides the sample in half) it indicates that the distribution does not have a skew, a situation rarely seen in reality. ***But in this particular case, the MODE is also exactly the same as the median and the mean and this, in turn, creates a perfectly symmetrical distribution, which is extremely rare in reality. When a perfectly symmetrical distribution is graphed on a frequency polygon, it will always create the perfectly bell-shaped curve, as seen below.

9

Pre-Test Stress, n=9 3.5 3 2.5 2 f

1.5 1 0.5 0 1

2

3 Stress Level

4

5

10

When the curve of this frequency polygon is smoothed out with the black line (called a trendline), it produces the famous, and extremely rare, perfectly bell-shaped curve. (I know that it looks odd to have numbers with a .5 on the frequency axis but I cannot get the .5’s to disappear with the older version of Excel that I am using. Sorry. Or am I really sorry?) Again, the reason why the curve is perfectly bell-shaped here is because the MODE, MEDIAN and MEAN are all exactly the same. The peak, or highest point of the curve, always indicates the mode, which is stress level 3 in this example. However, at the peak of the curve we also find the median and mean of stress level 3. Stress level 3 perfectly divides the curve in half, or gives a perfect symmetry to the curve, because the left half of the curve from stress level 3 going leftward looks exactly like the right half of the curve from stress level 3 going rightward.

2) Symmetrical Distribution producing the Symmetrical Curve Another type of uncommon distribution, though not as rare as the perfectly symmetrical one, is the symmetrical distribution, defined as a distribution whose median and mean are the same but whose mode is different. Like a perfectly symmetrical distribution, though, the symmetrical distribution is not skewed because the mean and median are still equal. Consider the same example asking college students about their pre-test stress, with now the following distribution.

Pre-Test Stress, n=10 Stress Level 1 2 3 4 5

f 2 2 2 2 2

Σ

1 0

The measures of central tendency are not all exactly the same here. Mode = there is no mode Median = stress level of 3

(since no stress level has a greater frequency than the others)

11

(looking for middle of 10 people, 10/2 = 5, since 10 is an even number there will be two middle people, hence need 5th and 6th persons, both of whom are in stress level 3 category) Mean = stress level of 3 (x*f = 30, 30/10 = 3) Since the median and the mean are the same, there is no skew in a symmetrical distribution. The same (for symmetrical) number of people will be found on either side of the mean because the mean is also playing the role of the median. When displayed in a row and ranked by stress level, these ten students would look like this.

1

1

2

2

3

3

4

4

5

5

As can be seen, the two students with a mean stress level of 3 create symmetry because they have the same number of people (or four) on either side of them. This occurs because the mean is also the median and hence, like the median (which always divides a sample in half), the mean in this example is also dividing the sample in half. Again, this is not seen very often in reality. Graphing this distribution with a frequency polygon would produce the following curve.

Pre-Test Stress, n=10 2.5 2 1.5 f

1 0.5 0 1

2

3 Stress Level

This perfectly horizontal line is still called a curve.

4

5

12

The curve is perfectly horizontal, with no peak whatsoever, because there is no mode (there are 2 people in every stress category). But this is still a symmetrical curve because there are the same number of people on either side of the mean (stress level 3). This is because the mean is the same as the median and splits the sample in half. Here is another example of a symmetrical distribution. The median and mean are the same but the mode is different.

Pre-Test Stress, n=10 Stress Level 1 2 3 4 5

f 1 3 2 3 1

Σ

10

Modes = stress level 2 and 4

Median = stress level of 3

( both 5th and 6th persons are in stress level 3 category)

Mean = stress level of 3

(x*f = 30,

30/10 = 3)

When displayed in a row and ranked by stress level, these ten students would look like this.

1

2

2

2

3

3

4

4

4

5

Here, too, the mean stress level creates symmetry because it has the same number of people (or four on either side of it. This is because the mean is also the median and splits the sample in half. Graphing this distribution (see next page) would produce the following curve.

13

Pre-Test Stress, n=10 3.5 3 2.5 2 f

1.5 1 0.5 0 1

2

3 Stress Level

4

5

14

The curve does not look perfectly bell-shaped because it has two peaks, created by the two modes, and it also dips in the center. However, if we were to imagine a vertical line running down the graph at the mean stress level of 3, one half of the curve would still look like the other half, or have symmetry on either side of the mean line. Therefore, like a perfectly symmetrical distribution, a symmetrical distribution does not have a skew. _________ There are two other types of distributions and curves that, on the other hand, do have skews. A skew automatically indicates that we would not find the same number of people on either side of the mean. Skewed distributions, or skewed curves, are very common in reality. They may be very skewed or minimally skewed, but in real life it is extremely common to see distributions whose mean is not the same as the median. Pay close attention to what comes next.

3) Positively Skewed Distribution producing the Positively Skewed Curve The examples that follow are taken from real data collected from our class (which you may or may not be missing terribly at this point). The question of how stressed students are before a test was asked on the anonymous survey filled out at the start of term. The answers have been divided according to gender. The results do not surprise me at all because every semester the males and females in my classes produce different curves when the variable of pre-test stress is analyzed. The male results will be examined first.

Pre-Test Stress for Males, n=12 Stress Level 1 2

f 0 7

15

3 4 5

2 2 1

Σ

12

Practice calculating all the measures of central tendency before checking the answers on the next page. Correctly round the mean to one decimal place.

Pre-Test Stress for Males, n=12 Stress Level 1 2 3 4 5

f 0 7 2 2 1

X*f 0 14 6 8 5

Σ

12

33

Mode = stress level 2 Median = stress level 2 (there are 12 males, so need both 6th and 7th persons since 12 is an even number; both 6th and 7th persons are in stress level 2 category)

16

Mean = stress level of 2.8

(x*f = 33,

33/12 = 2.75, rounds to 2.8 at one decimal place)

Because the mean is not the same as the median, this distribution is skewed. We can also say that the mean is skewed, although not by very much since there is only a .8 difference separating the mean and median.

Note the following very carefully: When the MEAN is LARGER than the MEDIAN, as is the case here, then the distribution is positively skewed, or we can say that the distribution has a positive skew. We can also say that the mean is positively skewed or has a positive skew.

***When the skew of a distribution is positive, the following will always occur:

More than 50% of the sample will be below the mean or have less than the mathematical average. Test this out on the sample of 12 males by carefully examining what comes next.

Is it true that more than 50% of these 12 males have a stress level that is less than the mean stress level of 2.8? The mean of 2.8 occurs somewhere between stress levels 2 and 3. Situate the mean of 2.8 on the table by imagining a line between levels 2 and 3 as illustrated next.

Pre-Test Stress for Males, n=12

mean = 2.8

Stress Level 1 2

f 0 7

------------3 4 5

----------------------2 2 1

17

Σ

12

Then ask yourself on which side of this imaginary mean line of 2.8 are most males? In red we see that 7 of the 12 males (or 58.5%) have a stress level that is less than 2.8; in other words, their stress level is below, or less than, the mathematical average. ***The positive skew of male pre-test stress levels allows us to conclude that most males are not very stressed before a test because most (i.e., more than 50%, or specifically 58.5%) have a stress level that is less than the mean level of 2.8. To repeat (because I live to repeat myself), in a positively skewed distribution, more than 50% of the sample will always be below average (in the sense of having less than...


Similar Free PDFs