STAT1070 S1 2021 Assignment 1 Solutions PDF

Title STAT1070 S1 2021 Assignment 1 Solutions
Author Kate Strauss
Course Statistics for the Sciences
Institution University of Newcastle (Australia)
Pages 12
File Size 410.9 KB
File Type PDF
Total Downloads 45
Total Views 132

Summary

Assignment 1 Solutions...


Description

STAT1070 Assignment 1 Solutions Semester 1, 2021 Due: Electronically via Blackboard by 11:59pm, Sunday 4 April. Please justify your answer to each question. This justification can involve hand calculation or providing relevant interpretation of output from jamovi and/or statstar.io. If a question requires a calculation, please show your working. If a question requires output from statstar.io, please provide and refer to this output accordingly. Do not simply copy and paste jamovi or statstar.io output, but provide concise interpretation of this output where appropriate. Statistics is about communication as much as it is about calculations. Please provide your answers in a sentence to communicate your response. Your assignment does not need a cover sheet. There is a template you can use to help structure your assignment on Blackboard. The majority of your assignment should be typed however if you need to hand draw anything you can include this in your document by scanning or taking a photo of your working. Submit your assignment as a PDF to ensure that it successfully uploads and is able to be read. Make sure you receive a receipt for submission and record or save the information in it. It is your responsibility to ensure that your assignment has been uploaded properly. If you experience problems submitting your assignment, email [email protected] a copy of the completed assignment before the due date and outline the problems you are having with the submission. The mark for an assignment submitted after the due date and time, without an approved extension of time, will be reduced by 10% of the possible maximum mark for the assessment item for each day or part day that the submission is late. Note: this applies equally to week and weekend days. Given the need to make the solutions made available to students, assignments will not be accepted after 7 calendar days beyond the due date.

1

Question 1. The jamovi file unistudents.omv contains data on 51 randomly selected university students. The variables in the data set are: Sex The sex of the student. There are two students who did not disclose their sex (ND). Height The measured height in centimetres. University Year The year of their degree program the student is studying. Area Study The area in which there degree program is. P - Physical sciences; H - Humanities; B - Biological Sciences; S - Social Sciences. Hours Study The number of hours per week the students devote to their studies. (a) [6 marks] What type of variable are: Height; University Year; and, Area Study. Provide a justification for your choices. (b) [4 marks] Provide the appropriate plot for the distribution of heights . Describe the distribution. (c) [3 marks] Provide the appropriate plot for the distribution of areas of study. Describe the distribution. (d) [5 marks] Describe the relationship between height and hours of study ensuring you provide appropriate output from jamovi. Justify your statement about the strength of the relationship. Would you expect that there is a relationship between these two variables? Justify your response. (e) [4 marks] Provide the appropriate plot for determining if there is a relationship between university year and the number of hours students study. Describe the relationship.

Solution: (a)

Height is a continuous variable. For a given interval there are infinitely many possible responses for this variable, therefore it is continuous. Area Study - This is a nominal variable as the responses are categories that have no natural order. University year is ordinal/discrete/continuous. [Ordinal] If we think of the year as not meaning the counted number of years a student has been at uni and instead the stage at which they are in their degree, ie first year means they are doing first year courses regardless of how many semesters they have actually been at uni, then the time has been broken down into separate categories that have a natural order. [Discrete] If we think of the possible responses as being a count of the number of years a student has been at uni, and a response such as 1.5 years doesn’t make sense we have a discrete variable. Another way to think about it is suppose we have any two possible responses, say 2, and 4. There may be considered a fixed number of responses between these and therefore discrete. [Continuous] If we think of the responses as representing the number of years a student has been at uni, we can say 2

Figure 1: Boxplot and descriptive statistics table for height. that we have a measurement of time that has been rounded. Rounding doesn’t change the nature of the variable. Therefore there are a continuum of possible responses between two values, it is just that they are always rounded down. University year is a very tricky one to determine, especially with the limited amount of information provided. In practice, one would seek out more information from those who collected the data and look at the question posed to the students or the methodology used to allocate responses. What is more important is the justification for your choice. [2 each marks] all marks allocated for the justifications (b) From Figure 1, the distribution of heights is roughly symmetric and bimodal. The mean of the distribution is 174.9cm with a standard deviation of 8.38cm. The two modes appear around 170cm and 182cm. The two modes are likely explained by the sex of the student. [1 mark] appropriate boxplot and/or histogram [1 mark] mention shape - bimodal [1 mark] mention centre - mean, or mean and median, depending on what they say about symmetry [1 mark] mention spread - consistent with location measures given Note: Not necessary to mention the location of the modes but is something that should be done to properly describe the distribution. Note: If units are not used only penalise once in this whole question (c) The plot in Figure 2 shows that the most common areas of study are social science and biological science. These two programs account for roughly 75% of the areas of study the students take. The area with the lowest count of students is humanities. [1 mark] Pareto chart 3

Figure 2: Pareto chart for the distribution of Area of Study.

Figure 3: Scatterplot and correlation for hours study and height. [2 marks] A description of the plot with some mention that the two areas S and B take up about 75% of students (majority) (d) The scatterplot in Figure 3 shows a weak negative relationship. Given that the relationship is so weak it is hard to determine if it is linear or not. The sample correlation of -0.228 also suggests that the relationship is negative and weak. We are looking to see of there is a relationship between height and the number of hours someone devotes to study. [expecting no relationship] Without knowing if there is any genetic link between height and one’s commitment to their study, I would not expect there to be a relationship. [expecting a relationship] As students progress through university they are getting older which means that they may be getting taller, meaning that students at later stages of their degree may be studying more and are taller, despite the scatter plot suggesting the opposite. [1 mark] scatterplot with height on the x-axis and hours on the y-axis [2 marks] weak, negative, and some comment about linearity/non-linearity. [1 mark] correlation coefficient of -0.228 [1 mark] comment about expectation of their being a relationship

4

Note: provided two example answers for an expectation. Students are only expected to provide one. (e) If the variable of University Year is considered ordinal, we would expect a plot of side-by-side boxplots. Even if it is considered discrete we would still likely use side-by-side boxplots. This is show in Figure 4. That said, a scatter plot is also shown below in Figure 5. You can see that it presents similar information to the side-by-side boxplots. Looking at the side-by-side box plots we can see that each of the box plots are roughly symmetric with some years being skewed. There are points that are identified as potential outliers. Therefore we will focus on median and IQR as the measures of centre and spread. Starting with a comparison of locations, the median hours of study appears to increase as university year increases with medians of 25.7, 27.4, 32.25 and 44.40 hours for years 1 to 4 respectively. The means for those years are 25.34, 28.90, 33.40, and 39.68 hours. Similar to the means, it appears that the spread increases over the years with IQRs of 4.60, 9.70, 7.325, and 14.20 hours for years 1 to 4 respectively. The standard deviations for those years are 4.59, 8.03, 8.23, and 11.20 hours. [1 mark] side-by-side boxplots [1 mark] shape comparison: mention of possible skewness and outliers [1 mark] location comparison [1 mark] spread comparison Note: if presented scatter plot, should talk about linearity, direction, and strength

5

Figure 4: Boxplots and descriptive statistics for hours of study broken down by university year.

6

Figure 5: Scatter plot of hours of study against university year with correlation output.

Question 2. To get full marks for the following questions you need to convert the question from words to a mathematical expression (i.e. use mathematical notation), defining your events where necessary, and using correct probability statements. A polling company performed research on how well Australian state premiers were performing in light of the COVID-19 pandemic. Respondents form each state of Australia were asked for their level of approval for their particular state premier, with possible option of “approve”, “don’t know”, and “disapprove”. In addition to this their gender was recorded. Among men, 40 approved, 45 disapproved and 10 said don’t know. Among women, 62 approved, 30 disapproved and 5 said don’t know. For the following questions, use A1 to define the event that someone approves of their state premier, A2 and A3 to define the events disapprove and don’t know respectively. Let the event M represent the event that the respondent was male. Use the above sample counts to determine the probabilities asked for in the following questions.

(a) [2 marks] What type of variable is the level of approval for one’s state premier? Carefully explain your reasoning. (b) [2 marks] Construct a contingency table summarising the data collected. (c) [2 marks] If we were to choose a male at random, what is the probability that they will approve of the state Premier’s performance? (d) [2 marks] What is the probability that a randomly selected Australian is a woman who approves of the state Premier’s performance? (e) [2 marks] What is the probability that a randomly selected Australian approves of their Premier’s performance? 7

approve disapprove don’t know Total

Men 40 45 10 95

Women 62 30 5 97

Total 102 75 15 192

Table 1: Contingency table for polling data (f) [2 marks] A member of the NSW state government was made aware of the research and emphatically reported to the NSW state premier that the majority of voters approve of their performance. Comment on whether you agree with this party members assessment of the data.

Solution: (a) The variable is a categorical variable as the responses are broken down into categories. To decide whether it is ordinal or nominal, we need to think about whether there is a natural order. If we think of “don’t know” as representing as a neutral response we might consider this variable as being ordinal. If “don’t know” is a response that people provide when they have no idea about the topic, ie they dont even know who the premier is then this response might not allow for a natural order, in which case the variable would be nominal. [2 marks] can have either ordinal or nominal but must provide a justification Note: no marks if no justification (b) The table in Figure 1 summarises the polling data. [1 mark] completing the joint count cells in the centre of the table [1 mark] completing the marginal totals (c) P (A1 |M) = 40/95 = 0.421 Therefore if we were to choose a male at random, the probability that they will approve of the state Premier’s performance is 42.1%. [1 mark] correct probability with correct notation [1 mark] sentence summarising their calculation (d) P (A1 ∩ M c ) = 62/192 = 0.323 Therefore the probability that a randomly selected Australian is a woman who approves of the state Premier’s performance is 32.3%? [1 mark] correct probability with correct notation [1 mark] sentence summarising their calculation (e) This asks for P (A1 ). To get this from the information available. P (A) = 102/192 = 0.531 Therefore 53.1% of Australians approve of their state premier. [1 mark] correct probability with correct notation [1 mark] sentence summarising their calculation

8

(f) The data is collected across each of the states. Therefore the probabilities cannot be attributed directly to any one state premier. Some may have approval greater than 50% where as others may have approval below 50%. [1 mark] for recognising the member has made a statement about a population for which they have not collected data from [1 mark] a reasonable argument that is justified

9

Question 3. To get full marks for the following questions you need to convert the question from words to a mathematical expression (i.e. use mathematical notation), defining your random variables where necessary, and using correct probability statements. A German physician, Carl Wunderlich, measured temperatures from about 25,000 people in the mid 1800s and found that the average was 37 degrees Celsius. There is some evidence that the mean body temperature has been changing over time. Assume that the body temperature of adults is approximately normally distributed with a mean of 37 degrees Celsius and standard deviation 0.40 degrees Celsius. (a) [2 marks] What value of body temperature distinguishes the highest 10%? (b) [2 marks] People with body temperatures above 38 degrees are said to have a significant fever. Show that approximately 0.62% of people have a significant fever. (c) [4 marks] Suppose we take a random sample of 50 people and want to determine whether or not they have a significant fever. Let X represent the number of people in a sample of size 50 who have a significant fever. What probability distribution does X follow? Justify your answer. (d) [2 marks] Using the probability distribution from part (c), find the probability that at least 1 person in the random sample of 50 have a significant fever.

Solution: Let T be the body temperature of a randomly selected adult. From the information given, T ∼ N (µ = 37, σ2 = 0.402 ). (a) We are interested in P (T > t) = 0.10. Figure 6 shows the calculation in StatStar for P (T < 37.513) = 0.90. Therefore we find that a temperature of 37.513 degrees Celsius defines the top 10% body temperatures.

Figure 6: Output from statstar.io for Question 3a [1 mark] for P (T ≥ c) = 0.1 or equivalent statement 10

[1 mark] for correct answer with some evidence of quantile (working and/or output) Note: A student can get full marks without standardising by entering a mean of 37 and standard deviation of 0.40 into statstar. (b) Here we are interested in P (T > 38). Figure 7 shows the calculation in StatStar for P (T < 38) = 0.9938. Therefore P (T > 38) = 1 − 0.9938 = 0.0062.

Figure 7: Output from statstar.io for Question 3b [1 mark] for P (T > 38) [1 mark] for correct answer with some evidence of probability (working and/or output) Note: A student can get full marks without standardising by entering a mean of 37 and standard deviation of 0.40 into statstar. (c) The distribution of X is binomial with n = 50 and p = 0.0062 because it satisfies the binomial conditions: There are a fixed number of trials (n = 50) Each trial can be labelled as a “success” (significant fever) or “failure” (no significant fever) Each trial is independent (ensured by random sampling) The probability of success is constant for each trial (p = 0.0062 as found in part (b)) [1 mark] for identifying that X follows a binomial distribution [1 mark] for n = 50 and p = 0.0062 [1 mark] for noting at least 2 of the conditions of the Binomial setting [1 mark] for noting all 4 conditions of the Binomial setting (d) P (X ≥ 1) = 1 − P (X ≤ 0) = 1 − P (X = 0)   50 =1− (0.0062)0 (1 − 0.0062)50 0 = 1 − 0.7327 = 0.2673 Hence, there is a 26.73% chance of at least 1 person in the random sample of 50 having a significant fever. 11

Figure 8: Output from statstar.io for Question 3d Alternatively, you can find this probability with statstar.io, as shown in Figure 8. [1 mark] for P (X ≥ 1) or equivalent statement [1 mark] for the correct probability with justification (working or output)

12...


Similar Free PDFs