Lab8 F19 Final - Lab 8 PDF

Title	Lab8 F19 Final - Lab 8
Course	Introductory Statistics
Institution	Cornell University
Pages	8
File Size	194.1 KB
File Type	PDF
Total Downloads	56
Total Views	147

Preview

CLICK TO PREVIEW PDF

Summary

Lab 8...

Description

STSCI 2150 Lab 8: Fitting Probability Models to Frequency Data Due on Friday October 25, 2019 at 11:59 PM Your solutions to the problems below should be put into a lab report. You should work on this lab during your lab section, so that you can get help if needed from the teaching assistants. You may complete the lab later. You will have about one week to complete the lab report. Include in your lab report the R code, R output, and the plots used to answer the problems. This will allow the grader to help you find errors, if any, and to give partial credit. Please submit your homework, electronically, in pdf format.

Goals • Better understand the logic of hypothesis testing • Compare frequency data to a proportional distribution or a Poisson distribution • Learn how to calculate a chi-squared statistic for a Goodness-of-Fit test

1

M&M Activity

This activity will be your first introduction to the Goodness of Fit test. Using a bag of M&M’s, you will be counting the number of each M&M color. Then you will be comparing these counts to the expected counts provided by the M&M/Mars company statistics. We will use R to perform a Goodness of Fit test to determine if the difference is statistically significant. To begin, the students should divide themselves into groups of two to three people. Each group will be given a container of M&Ms. 1. Open your container of M&M’s. Count and record the number of each color, and the total number of M&M’s (for example 10 orange, 5 yellow, 6 red, 6 blue, 4 brown, 5 green, and 36 M&M’s total). 2. Your TA will compile all the tallies for the lab section. You should do your analysis on the data collected for your entire lab section. 3. Enter the aggregated color and total counts into the observed row in the table in Problem 1. 4. Using the data from the entire lab section, compute the expected number of each color from based on the percentages below. Enter the expected counts into the expected row in the table in Problem 1.

1

Color Blue Orange Green Yellow Red Brown

Expected Percentage 25% 25% 12.5% 12.5% 12.5% 12.5%

For example, if there were 600 total candies in the bag, the expected number of blue candies would be 600 · 0.25 = 150. 5. Using both the observed counts for each color, and the expected counts computed in steps (3) and (4), complete Problem 1.

2

The Proportional Model and the Goodness-of-Fit Test

The proportional model is a simple model in which the frequency of occurrence of events is proportional to the number of opportunities. An example would be the frequency of births on each of the seven days of the week. The proportion for any particular day should be close to 1/7 under an equal probability assumption. We will return to the example that we used in the previous lab about randomly chosen numbers between 1 and 10 on the first class poll in a previous semester. Under equal probability, we would expect each number to be chosen by about 1/10 of the respondents. That is, each number would be chosen by about the same number of students. Under this model, we expect each random number to be chosen about 14.3 times, or (1/10)*(143), where 143 is the total number of respondents. We wish to test the null hypothesis that the random numbers chosen follow a uniform distribution. The null and alternative hypotheses are: H0 : The proportion of occurrences of each number choice equals 1/10. Equivalently we could say,“The data have a uniform distribution”. (p1 = p2 , . . . , = p10 =

1 10 ,

where pi is the probability of choosing the

i-th integer.) HA : At least one category has a proportion that is not equal to 1/10.“The data do not have a uniform distribution.” (At least one pi is not equal to

1 10 ).

The results are shown in the following table and in Figure 1. Chosen Number Observed Expected

1 6 14.3

2 3 4 5 6 7 8 9 10 12 18 13 14 17 26 24 10 3 14.3 14.3 14.3 14.3 14.3 14.3 14.3 14.3 14.3

The observed values for the number of times each random number was chosen, are given in the vector random. There appears to be an excess of 3’s, 6’s, 7’s and 8’s and a deficit of all other numbers. The expected values are the same for each digit, 1 through 10, since we assume our observed distribution is 2

Figure 1: Students’ Randomly Chosen Integers

the uniform distribution. Thus, the expected values are 14.3 for each digit; we create a vector of expected values by repeating the value 14.3 ten times. We calculate the chi-squared test statistic and find the associated P-value using the code below. options(digits= 4) randomnum= c(6, 12, 18, 13, 14, 17, 26, 24, 10 ,3 ) observed = randomnum expected = rep(14.3,10) test.stat = sum((observed - expected)^2 / expected) df = 9 pvalue= 1 - pchisq(test.stat, df)

The test statistic, test.stat, is 33.15. With df = 9, the resulting P-value, pvalue, is much less than 0.01, which means that we reject the null hypothesis. We conclude that last semester’s class did not pick random numbers uniformly. A somewhat more elegant set of R commands to calculate the chi-squared test statistic would look like this... # Alternative code to calculate chi-squared statistic students= sum(randomnum) expected.num = (1/10)*students expected = rep(expected.num, 10) test.stat= sum((observed - expected)^2 / expected)

3

ToDo Task Question 1: Do a hypothesis test of whether students are equally likely to choose the integers 1 to 10 using this semester’s data, located in “First Class Poll F19.csv” of the Lab 8 Module. 1. Write the null and alternative hypotheses. Use both a sentence and statistical notation. 2. State the test statistic and its value. Calculate the p-value and make a conclusion of whether to reject the null hypothesis or not. 3. Was the distribution significantly different from the uniform distribution?

3

Comparison to a Poisson Distribution

The records of in-hospital heart attacks at a particular hospital were compiled weekly, over a period of 261 weeks.We are interested in knowing whether in-hospital heart attacks occur ”randomly” in time. The best way to answer this question is to test whether of not the data have a Poisson distribution. The results are shown in the following table and in Figure 2. At a quick glance, the data do appear to have a Poisson distribution. No. of In-Hospital Cardiac Events Observed

0 89

1 90

2 59

3 17

4 6

The table shows, for example, that there were six weeks during this time period in which there were exactly four in-hospital cardiac events. There are a total of 261 weeks in the data set. Our research question is: Does the distribution of in-hospital cardiac events follow a Poisson distribution? The null and alternative hypotheses are: H0 : The distribution of weekly in-hospital cardiac events is Poisson. HA : The distribution of weekly in-hospital cardiac events is not Poisson. The data are provided in the file, “cardiac events.csv”, and are in a weekly summary format. The variable events is the number of in-hospital heart attacks during that week. The variable freq.in is the number of weeks that had that number of events. Figure 2 shows the data and an overlaid Poisson distribution. We approach this problem in much the same way as we did in the previous sections. We wish to compare our observed distribution of in-hospital cardiac events to a Poisson distribution. First, we build

4

Figure 2: Weekly Counts of In-Hospital Cardiac Events

our Poisson distribution. To do this, we find the mean of our data, denoted µ ˆ . The estimate, µ ˆ , is just the average number of cardiac events over the entire 261 weeks. We calculate: µ ˆ = (89 · 0 + 90 · 1 + 59 · 2 + 17 · 3 + 6 · 4) /261. The next step is to calculate the Poisson probabilities for each of the possible values: 0, 1, 2, 3, 4 using the Poisson probability formula, pr(X = x) =

e−ˆµ µ ˆx x!

and find the associated Poisson probabilities. We multiply the probabilities by the total number of weeks (261) to get the expected frequency for each number of in-hospital cardiac events. These are our expected counts. No. of In-Hospital Cardiac Events Observed Expected

0 89 88.25

1 2 3 4 90 59 17 6 95.69 51.88 18.75 5.08

Once we have this vector of expected values, we find Pearson’s chi-squared statistic which has the general form: χ2 =

X (Oi − Ei )2

Ei

i

where Ei and Oi are the expected and observed values for the i-th group, where for i = 1, . . . , k groups. We calculate the test statistic and find the associated P-value. First we have to read in the data and define the variables, as shown in the R code below: 5

options(digits= 4) # read in ‘‘cardiac events.csv’’ data and save variables cardiac= read.csv("cardiac events.csv", header= T) names(cardiac) head(cardiac) events= cardiac$events freq.in= cardiac$freq.in

The probability density function for Poisson distribution is as follows: for a random variable X ∼ P oisson(µ), P (X = x) =

e−µ µx x!

. In the code below, we calculate the expected values under the Poisson

assumption and calculate the chi-squared statistic and P-value. First, we calculate the mean frequency of events in our sample by summing over the event frequency and divide the total by the total time. Next, calculate the expected value under the Poisson distribution using the estimated mu. # estimate mu mu_hat = sum(events*freq.in) / sum(freq.in) mu_hat # find expected values under Poisson distribution dens= dpois(events, lambda= mu_hat) expected= dens*sum(freq.in) # calculate chi-squared and associated P-value observed= freq.in chi.sq= sum(((observed - expected)^2) / expected) df = 3 pvalue = 1 - pchisq(chi.sq, df)

Note that we calculate the degrees of freedom as df = (Number of categories) - 1 - (Number of parameters estimated from the data)= 5 − 1 − 1 = 3. If we calculate the chi-squared statistic as shown above, we get 2.98, with a P-value of 0.39. Since the P-value is greater than 0.05 we do not reject the null hypothesis. Our conclusion is that there is insufficient evidence to reject our hypothesis that the weekly counts of in-hospital cardiac events have a Poisson distribution. Since the cardiac events appear to have a Poisson distribution, the evidence indicates that cardiac events are happening randomly (or evenly) over time. You should not construe these results to mean that we “accept” the null hypothesis; instead, we are not rejecting it. This is an important distinction. It’s not possible to definitively state that the distribution is Poisson, since we can never be absolutely certain. But the evidence does support this hypothesis.

4

Summary of New R Commands

R Function Description dpois(count, lambda) Density value for the Poisson distribution with mean lambda (µ) pchisq(q,df) Cumulative distribution function for the chi-squared distribution with df degrees of freedom sum(x) Computes the sum of all the values present in its argument x 6

Problems Problem 1 (M&M Color Proportions) In order to complete this problem, data will be aggregated from each group for each class. Use the aggregated counts from your lab section to complete this problem. a. Complete the following table: Statistic Blue Percentages 25% Observed (O) Expected (E)

Orange 25%

Green 12.5%

Yellow 12.5%

Red 12.5%

Brown 12.5%

Total 100%

(O−E)2 E

b. Perform the appropriate test to determine whether the class’s candy color proportions were the same as those provided by the Mars company website. c. Make a 95% Agresti-Coull confidence interval for the true proportion of red M&M’s. Does your interval contain the true proportion of reds (12.5%) in the population? d. M&M’s are actually made in two different manufacturing sites in the US, one in New Jersey and the other in Tennessee. The expected percentages that were provided above were for the New Jersey factory. The percentages of colors at the Tennessee factory were: blue 20.7%, orange 20.5%, green 19.8%, yellow 13.5%, red 13.1%, brown 12.4%. Repeat this statistical analysis using these new proportions. Does it appear that our M&M’s come from the New Jersey factory or the Tennessee factory? Or does it appear that our M&M’s don’t seem to match either factory? Justify your answer.

Problem 2 (Did our class choose numbers randomly?) Complete Task Question 1.

Problem 3 In dragons, variation in wingspan is determined by a single gene. WW individuals have long wings, Ww (heterozygous) individuals have medium wings, and ww individuals have short wings. In a cross between heterozygous (medium-winged) dragons, the expected ratio of long-winged: mediumwinged: short-winged offspring is 1:2:1. a. The results of such a cross were 8 long-winged, 18 medium-winged, and 6 short-winged dragons. Do these results differ significantly (at a 5% level) from the expected frequencies? b. In another, larger experiment, you count 10 times as many dragons as in the experiment in part (a), and find 80 long-winged, 180 medium-winged, and 60 short-winged dragons. Do these results differ significantly from the expected 1:2:1 ratio? 7

c. Do the proportions observed in the two experiments [(a) and (b)] differ? Did the results of the two hypothesis tests differ? Why or why not?

Problem 4 Flu Vaccine In a recent class poll, 54 out of 87 students indicated that they had been vaccinated for the flu. a. Use the Agresti-Coull method to create a confidence interval for the true proportion of Cornell students who have gotten the flu vaccine. b. The national flu vaccination rate for Americans in the 18-29 age group is 33.6 percent. Using the results in part (a), does it appear that the rate of flu vaccination among Cornell students is significantly different from the national rate? Explain your reasoning.

Problem 5 Problem 5 Chapter 8, Problem 11 (Types of Distributions)

Problem 6 (Wine Snob?) When the performances of individuals, or preferences for products, are judged in sequence by subjective criteria, does the position in the sequence affect the opinion of the judges? An experiment to look for these order effects gave 33 volunteers four glasses of wine in sequence, one at a time. Participants were asked to say which of the four was the superior wine. Unknown to the participants, all four glasses were poured from the same bottle. Fifteen participants preferred the first glass, 5 preferred the second glass, 2 preferred the third glass, and 11 preferred the last glass. Is there evidence from these data that position in the sequence affected the preference of the volunteers? You should do all the steps of a hypothesis test to answer the above question. (Also, be sure to also check the sample size assumptions for the test.)

Problem 7 Chapter 8, Problem 12 (Spirit Bears)

Problem 8 Chapter 8, Problem 17 (Truffles) In order to satisfy the minimum sample size assumptions, you may need to combine categories.

8...