Exam 2015, questions and answers - Resit PDF

Title Exam 2015, questions and answers - Resit
Course Statistics 1
Institution University of London
Pages 13
File Size 251.7 KB
File Type PDF
Total Downloads 82
Total Views 127

Summary

Resit...


Description

ST104a Statistics 1

Examiners’ commentaries 2015 ST104a Statistics 1 Important note This commentary reflects the examination and assessment arrangements for this course in the academic year 2014–15. The format and structure of the examination may change in future years, and any such changes will be publicised on the virtual learning environment (VLE). Note that in what follows • corresponds to 1 mark unless stated otherwise.

Information about the subject guide and the Essential reading references Unless otherwise stated, all cross-references will be to the latest version of the subject guide (2014). You should always attempt to use the most recent edition of any Essential reading textbook, even if the commentary and/or online reading list and/or subject guide refer to an earlier edition. If different editions of Essential reading are listed, please check the VLE for reading supplements – if none are available, please use the contents list and index of the new edition to find the relevant section.

Comments on specific questions – 23 September replacement examination Candidates should answer THREE of the following FOUR questions: QUESTION 1 of Section A (50 marks) and TWO questions from Section B (25 marks each). Candidates are strongly advised to divide their time accordingly. Section A Answer all parts of Question 1 (50 marks in total). Question 1 (a) Classify each one of the following variables as either measurable (continuous) or categorical. If a variable is categorical, further classify it as either nominal or ordinal. Justify your answer. (Note that no marks will be awarded without a justification.) i. Delay times of a particular train. ii. The type of fruit juice, based on the kind of fruit, produced by a company. iii. The difficulty level of a ski slope. iv. The quantity of wine produced by a vineyard. [8 marks] Reading for this question This question requires identifying types of variable so reading the relevant section in the subject guide (Section 4.6) is essential. Candidates should gain familiarity with the notion of

30

Examiners’ commentaries 2015

a variable and be able to distinguish between discrete and continuous (measurable) data. In addition to identifying whether a variable is categorical or measurable, further distinctions between ordinal and nominal categorical variables should be made by candidates. Approaching the question A general tip for identifying continuous and categorical variables is to think of the possible values they can take. If these are finite and represent specific entities the variable is categorical. Otherwise, if these consist of number corresponding to measurements, the data are continuous and the variable is measurable. Such variables may also have measurement units or can be measured to various decimal places. i. The amount can be measured, for example in minutes. Therefore, this is a measurable variable. ii. Each type of fruit juice (for example orange, apple, mixed etc.) is a category and there is no natural ordering. Therefore, this is a categorical nominal variable. iii. Each classification (for example a first, 2:1, 2:2 etc.) is a category and there is a natural ordering. Therefore, this is a categorical ordinal variable. iv. The quantity of wine can be measured in, for example in litres. Therefore, this is a measurable variable. (b) Consider the following sample dataset: 10,

9,

4,

12

x,

You are told that the value of the sample mean is 8. i. Calculate the value of x. ii. Find the sample variance. [4 marks] Reading for this question This questions contains material mostly from the subject guide, Chapter 4 and in particular Section 4.8 (Measure of location) for part (i) and Section 4.9 (Measure of spread) for part (ii). Approaching the question First you need to write down the formula for the sample mean. Then, it is important to do the summation carefully and divide with the correct number of observations to obtain the mean. Note that the sum in the numerator will contain the unknown x, hence this will give you a simple equation. The solution of this equation will provide x. The workout of the solution is as follows. i. • Since the sample mean is equal to 8, we can write: 10 + 9 + 4 + x + 12 =8 5 • or else: 35 + x = 40



x = 5.

ii. • Method:

(10 − 8)2 + (9 − 8)2 + (4 − 8)2 + (x − 8)2 + (12 − 8)2 . 4 • Correct value: 11.5. s2 =

Some candidates divided by 5 in the formula above. In such cases only one mark was awarded for part (ii), provided that the correct value was obtained. The reason is that the formula for the sample variance provided in the subject guide only suggests dividing by n − 1, where n is the number of observations. In another error that occurred in some cases, candidates subtracted the number x = 5 rather than the sample mean which is given to be 8.

31

ST104a Statistics 1

(c) The times of marathon runners, participating in the London Marathon, are normally distributed with mean 3.5 hours and a standard deviation of 0.75 hours. i. What is the proportion of runners in the London Marathon that finish in less than 3 hours? ii. What is the proportion of runners that finish the London Marathon with times between 2.5 and 4.5 hours? [4 marks] Reading for this question This section examines the ideas of the normal random variable. Read the relevant section of Chapter 6 of the subject guide and work out the examples and activities of this section. The sample examinations questions in this chapter are quite relevant. Approaching the question The basic property of the normal random variable for this question is that if X ∼ N (µ, σ 2 ), then Z = (X − µ)/σ ∼ N (0, 1). Note also that: ∗ P (Z < a) = P (Z ≤ a) = Φ(a) ∗ P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a) ∗ P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a). The above is all you need to find the requested proportions: i. • We can write: P (X < 3) = P



X − 3.5 3 − 3.5 < 0.75 0.75



= P (Z < −2/3).

• Continuing from above, we get P (Z > −0.67) = 1 − Φ(0.67) = 0.2525. ii. • We can write: P (2.5 ≤ X ≤ 4.5) = P



2.5 − 3.5 4.5 − 3.5 ≤Z≤ 0.75 0.75



= P (−4/3 ≤ Z ≤ 4/3).

• Continuing from above, we get: P (−4/3 ≤ Z ≤ 4/3) = Φ(1.33) − Φ(−1.33) = 0.9088 − 0.0912 = 0.8176. (d) A food chain claims that the mean fat content of a particular type of hamburger is less than 20%. In order to check this claim, a consumer group obtained a random sample of 36 hamburgers of this type, which were sent to an independent laboratory to measure the amount of fat in each one of them. Following the laboratory analysis, it was found that the sample average of fat content was x ¯ = 19.0%. Carry out a hypothesis test, at two appropriate significance levels, to determine whether the mean fat content in this type of hamburger is less than 20%. State your hypotheses, the test statistic and its distribution under the null hypothesis, and your conclusion in the context of the problem. It can also be assumed that the population variance of fat of each hamburger is 16. [7 marks] Reading for this question This question refers to a one-sided hypothesis test examining whether the mean fat content of hamburgers is less than 20%. While the entire chapter on hypothesis testing is relevant, candidates can focus on the relevant sections for a single mean (8.12 and 8.13) and in particular 8.13. The question refers to one-sided hypothesis tests that are located in Section 8.10 of Chapter 8.

32

Examiners’ commentaries 2015

Approaching the question It is essential to identify the type of hypothesis test required for this question. Since there is only one variable involved it will have to be a single mean test, and the test statistic can be found in the formula sheet. Make sure to substitute the relevant quantities carefully and avoid any numerical errors in the calculation. The next step is to identify the distribution of the test statistic. The population variance is assumed to be 16, i.e. known. Hence, the standard normal distribution should be used. The remaining steps involve finding the critical values from the corresponding statistical table for the relevant significance levels, deciding whether to reject H0 , and interpreting the results in the context of the problem. The working of the exercise is given below: ¯ accept H0 : µ ≥ 20%.) • H0 : µ = 20% vs. H1 : µ < 20%. (No Xs, • Test statistic value:

x ¯ − 20 19 − 20 = −1.5. p = 2/3 16/36

• The variance is known so the standard normal distribution should be used. • For α = 0.05, the critical value is −1.645. • Decision: do not reject H0 . • Choose larger α, say α = 0.1, hence −1.282, hence reject H0 . • Weak evidence that the mean fat content in this type of hamburger is less than 20%.

(e) The variable X takes the values 1, 3, 4 and 6 according to the following distribution x pX (x)

1 0.4

3 0.4

4 0.1

6 0.1

i. What is the probability that X is larger than 7? ii. Find E(X), the expected value of X . iii. Find the probability that X 2 > 2. [5 marks] Reading for this question This is another question on probability, exploring the concepts of relative frequency, conditional probability and probability distribution. Reading from Chapter 5 of the subject guide is suggested with a focus on the sections on these topics. Try Activity A5.1 and the exercises on probability trees. Approaching the question i. • X takes values that are at most 6, so the probability is 0. P ii. •• E(X) = i xi P (X = xi ) = 1 × 0.4 + 3 × 0.4 + 4 × 0.1 + 6 × 0.1 = 2.6.

iii. • The probability distribution of Z = X 2 will be: z pZ (z)

1 0.4

9 0.4

16 0.1

36 0.1

• Hence, the correct probability is 0.6. Note that this part may be answered without deriving the probability distribution table of Z. One can note that only the value X = 1 gives X 2 ≤ 2, hence the requested probability is 0.6.

33

ST104a Statistics 1

(f ) Suppose that x1 = −1, x2 = 2, x3 = 1, x4 = −1, x5 = 2, and y1 = 1, y2 = −2, y3 = 1, y4 = −1, y5 = 2. Calculate the following quantities:

i.

i=3 X

3xi

ii.

i=1

i=5 X i=3

2(yi − 1)

iii.

y42 +

i=4 X

(3xi + yi3).

i=3

[6 marks] Reading for this question This question refers to the basic bookwork which can be found on Section 2.9 of the subject guide and in particular Activity A2.6. Approaching the question Be careful to leave the xi s and yi s in the order given and only cover the values of i asked for. This question was generally well done. The answers are: i=3 P i. 3xi = 3(−1 + 2 + 1) = 6. i=1

ii.

i=5 P

2(yi − 1) = 2

i=3

iii. y24 +

i=5 P

(yi − 1) = 2((1 − 1) + (−1 − 1) + (2 − 1)) = 2(0 − 2 + 1) = −2.

i=3 i=4 P

(3xi + y3i ) = (−1)2 + (3 × 1 + 13 ) + (3 × (−1) + (−1)3 ) = 1 + 4 − 4 = 1.

i=3

(g) An insurance salesman visits one of two areas each day; namely A and B. The choice of which area to visit is made randomly every day with area A being chosen 40% of the time, whereas area B is being chosen with probability 60%. When the salesman visits area A there is a 40% chance that he will make an insurance sale, and when he visits area B the chance of a sale is 30%. i. What is the probability that the salesman will make a sale on a typical day? ii. Suppose that a sale was made on a particular day. What is the probability that the salesman visited area B? [4 marks] Reading for this question This is a question on probability and targets mostly the material of Chapter 5 in the subject guide. It is essential to practise on such exercises through the activities and exercises of this chapter as well as the material on the VLE. In particular you can attempt Activity A5.6 and Sample examination question 4. It is also useful to familiarise yourself with probability trees as they can be quite handy in such exercises. Approaching the question The first part was straightforward for those who were familiar with this section. Part (iii) requires knowledge of the conditional probability definition, although it can also be approached by common logic regarding the concept of independence. The workout is given below. i. •• We have: P (Sale) = P (Sale | A) P (A) + P (Sale | B) P (B) = 0.4 × 0.4 + 0.3 × 0.6 = 0.34. ii. •• We have: P (B | Sale) =

34

P (Sale | B) P (B) = 0.18/0.34 = 9/17 = 0.529. P (Sale)

Examiners’ commentaries 2015

(h) State whether the following are true or false and give a brief explanation. (Note that no marks will be awarded for a simple true/false answer.) i. A correlation coefficient of 0.9 between variables A and B suggests that there is a positive influence of variable A on variable B . ii. The lower the constant in the regression equation, the weaker the correlation. iii. Consider a confidence interval for a population mean, obtained by a sample, and assume that the population variance is unknown. If we add 5 to every number in the sample while keeping everything else the same, the width of the interval will not change. iv. When testing a hypothesis, we use a one-tailed test if we want to test whether the parameter is different from what is stated in the null hypothesis. v. A population list is needed in order to conduct stratified random sampling. vi. In some cases, the median can be larger than the upper quartile. [12 marks] Reading for this question This questions contains material from various parts of the subject guide. Here, it is more important to have a good intuitive understanding of the relevant concepts than the technical level in computations. Part (i) concerns normal random variables that can be found in Chapter 6 of the subject guide. Part (ii) is about correlation and regression (see Sections 12.7 and 12.8), whereas the next part targets confidence intervals (see for example Sections 8.7. to 8.9) . The next part (iv) targets the concepts of a p-value covered in Section 8.11 in the context of chi-squared test, presented in Chapter 9. Finally, part (vi) requires material from Chapter 10 and in particular the Section 10.7 on types of sample. Approaching the question Candidates always find this type of question tricky. It requires a brief explanation of the reason for a true/false answer and not just a choice between the two. Some candidates lost marks too for long, rambling explanations without indicating a decision as to whether a statement was true or false. i. False; the correlation may be spurious. Correlation does not imply causality. ii. False; the constant in the regression equation does not affect the correlation coefficient which is only linked with the slope. iii. True; the width is determined by the sample size, the t or Z value and the sample variance. All of these quantities will remain the same if we add 5 to each number in the sample. iv. False; this is done when we want to test if the parameter is greater than (or smaller than) what is stated in the null hypothesis. v. True; a population list is needed to randomly select individuals from each stratum. vi. False; the median is smaller or equal to the upper quartile. Hence it can never be larger.

Section B Answer two questions from this section (25 marks each). Question 2 (a) The following data show the marks (in %) achieved by a random sample of students in an end-of-year examination:

35

ST104a Statistics 1

62 66 41 82 99 36

78 90 48 92 95 80

88 93 66 62 47 70

63 33 63 47 61 58

50 35 96 37 60 73

i. Carefully construct a stem-and-leaf diagram of these data. ii. Find the median and the interquartile range. iii. Comment on the data given the shape of the stem-and-leaf diagram without any further calculations. iv Name two other types of graphical displays that would be suitable to represent the data. [12 marks] Reading for this question Chapter 4 in the subject guide provides all the relevant material for this question. More specifically, reading on stem-and-leaf diagrams can be found in Section 4.7.4, but the entire Sections 4.7, 4.8 and 4.9 are highly relevant. Approaching the question i. A stem-and-leaf diagram, which was compatible with what the examiners were expecting to see, is shown below. Marks were awarded for including the title, correct labelling, vertical alignment and reasonable accuracy. Stem-and-leaf diagram of end-of-year examination marks Stem = 10s of % | Leaf = 1s of % 3 3567 4 1778 5 08 6 01223366 7 038 8 028 9 023569 ii. We have: • Median = 63. • Q1 ≈ 47.5 and Q3 ≈ 85. • Interquartile range = Q3 − Q1 ≈ 37.5. iii. Approximately symmetric, with a very slight positive skew. iv. Any two of boxplot, dot plot and histogram. (b) An experiment was conducted to investigate the choices made in mutual fund selection. Random samples of 125 undergraduate and 90 MBA students were presented with different funds which were identical except for fees. The data are summarised in the table below: High-cost fund Low-cost fund

Undergraduates 34 91

MBA students 16 74

i. Give a 90% confidence interval for the difference in the proportions favouring the high-cost fund between undergraduate and MBA students. ii. Carry out an appropriate hypothesis test at two appropriate significance levels to determine whether MBA students are less likely to choose the high-cost fund compared to undergraduate students. State the test hypotheses, and specify your test statistic and its distribution under the null hypothesis. Comment on your findings.

36

Examiners’ commentaries 2015

iii. State any assumptions you made in (ii.). [13 marks] Reading for this question Look up the sections about hypothesis testing and confidence intervals for differences in proportions; more specifically sections 7.13, 8.14 and 8.15 in Chapters 7 and 8 of the subject guide. Approaching the question i. Let p1 , n1 refer to the proportion of high-cost-favouring undergraduate students and to the total number of undergraduate students, respectively. Similarly, denote by p2 and n2 the corresponding quantities for MBA students. The calculation for the confidence interval is straightforward given the formula sheet; make sure you are able to recognise the relevant formula. First, the standard error needs to be calculated: s.e.(p1 − p2 ) =

s

p1 (1 − p1 ) p2 (1 − p2 ) + = 0.0566. n1 n2

Then, the lower and upper bounds can be found to be 0.0010 and 0.1874 respectively. Finally, the above should be presented as an interval (0.0010, 0.1874). ii. As before, let π1 denote the proportion of MBA students choosing the high-cost fund π2 the corresponding proportion for undergraduate students. Also denote by p the overall proportion of high-cost fund choices. Regarding hypotheses, note that the wording ‘less likely’ suggests a one sided test: H0 : π1 = π2 vs. H1 : π1 < π2 . The next step is to identify the test statistic which is (p1 − p2 )/(s.e.(p1 − p2 )), and follows a standard normal distribution. s   1 1 + = 0.0584. s.e.(p1 − p2 ) = p(1 − p) n1 n2 Based on the above the value of the test statistic is −1.6133. The critical value at the 5% level is −1.645, hence we do not reject H0 at the 5% level. Testing at the 10% level gives a critical value of −1.282. Therefore, we reject H0 at the 1% level, concluding that there is weak evidence that MBA students are less likely to choose the high-cost fund compared to undergraduate students. iii. Any two of: • Sample size is large enough to justify the normality assumption. • Equal variances. • Independent samples. Some candidates stated assumptions in this part that were not made in part...


Similar Free PDFs