Tutorial 2 - (with Answers) PDF

Title	Tutorial 2 - (with Answers)
Author	Nuzhat Khan
Course	Business Data Analytics
Institution	McMaster University
Pages	4
File Size	162.7 KB
File Type	PDF
Total Downloads	292
Total Views	879

Preview

CLICK TO PREVIEW PDF

Summary

Tutorial 2Question 1Consider the following data, suppose Q1 and Q3 are calculated to be 123 ad 126, respectively.Find out if any of the following datapoints is an outlier.126 128 126 123 121.123 123 125 128 117.Answer:To find the upper and lower fences, and find out if we have any outliers we have,I...

Description

Tutorial 2 Question 1 Consider the following data, suppose Q1 and Q3 are calculated to be 123 ad 126.6, respectively. Find out if any of the following datapoints is an outlier. 126

128

126.6

123

121.7

123

123

125.7

128.9

117.7

Answer: To find the upper and lower fences, and find out if we have any outliers we have, IQR=Q3-Q1=126.6-123=3.6 Upper fence=Q3+1.5*IQR=126.6+1.5*3.6=132 Lower fence=Q1-1.5*IQR=123-1.5*3.6=117.6 All the points fall within the lower and upper fences and therefore, we do not have any outliers!

Question 2 Jane, a financial analyst at an investment firm is conducting research on 250 companies within the energy sector listed on the TSX. She is looking at different measures in companies, including revenue, number of employees, etc. For the 250 companies that Jane is researching, average revenue was $869 million with a standard deviation of $62 million (normally distributed). Of the 250 companies, the average number of employees is 45,000 with a standard deviation of 4,900 (normally distributed). Suppose Jane randomly selects companies from the list of 250 companies to research and she would like to know which event is more likely to occur as she is doing her research, finding a company that has a revenue of $1.2 billion (or more) or a company that has 64,000 employees (or more).

Answer: In order to answer this question, Jane will need to find the standardized values (z-scores). A z-score is interpreted as the distance between that data point and the mean in terms of number of standard deviations. Data points that are further away from the mean are less likely to occur in random samples than data points that are closer to the mean. z-score = (value – mean) / standard deviation

Revenue case z = (1200 – 869) / 62 = 5.33 Employee case z = (64 – 45) / 4.9 = 3.88 Therefore, since the company with an employee count of 64,000 (company A) is 3.88 standard deviations above the mean while the company with revenue of $1.2 billion (company B) is 5.33 standard deviations above the mean, Jane is more likely to come across a company like company A in her research than a company like company B.

Question 3 In the sample of companies discussed in Question 1, indicate the following: a) The range that contains approximately the middle 68% of all companies when it comes to revenue. b) The range that contains approximately the middle 95% of all companies when it comes to the number of employees. c) The range that contains approximately the middle 99.72% (almost 100%) of all companies when it comes to the number of employees.

Answer: Reminder: The empirical rule is used here. However, in later chapters, once we start working with the Standard Normal Table, we will need to calculate the exact values for z. The values presented here are approximates. The data is normally distributed and therefore, we approximately have the middle 68% of data is in the range 𝑥 ± 𝑠, the middle 95% of all data points fall in the interval 𝑥 ± 2𝑠 and the middle 99.73% (almost 100%) of all observations are in the interval 𝑥 ± 3𝑠. See the picture below.

a) For revenue, the average and standard deviation of all 250 companies are $869 million and $62 million, respectively. Therefore, the interval that approximately contains the middle 68% of all observations when it comes to revenue will be 𝑥 ± 𝑠 = 869 ± 62 = [$807, $931] million

b) For the number of employees, the average and standard deviation of all 250 companies are 45,000 and 4,900, respectively. Therefore, the interval that approximately contains the middle 95% of all observations when it comes to number of employees will be, 𝑥 ± 2𝑠 = 45,000 ± 2(4900) = [35,200,54,800]

c) The interval that approximately contains the middle 99.73% (almost 100%) of all observations when it comes to number of employees will be, 𝑥 ± 3𝑠 = 45,000 ± 3(4900) = [30,300,59,700]

Question 4 Consider the following sample data {16, 12, 33, 28, 32, 25, 27, 14, 14, 22}: a) Calculate the 52nd percentile. b) Calculate 70th percentile.

Answers: To calculate percentiles we need to first order the sampled data in ascending order. This gives, {12, 14, 14, 16, 22, 25, 27, 28, 32, 33} a) To calculate the 52nd percentile we multiply 0.52 by the number of observation (10), 0.52 × 10 = 5.2 Since 5.2 is non-integer, we round up to 6. The 52nd percentile of this data set is 25. b) To calculate the 70th percentile we multiple 0.70 by 10, 0.70 × 10 = 7 Since 7 is an integer we will find the average of values on the 7th and 8th position. The 70th percentile is then, 27 + 28 = 27.5 2

Question 5 The following is a set of data from a sample of size n=9: X Y

7 21

8 24

3 9

6 18

12 36

4 12

a) Compute the coefficient of correlation, r. b) How strong is the relationship between X and Y? Explain.

9 27

15 45

18 54

Answers: a) We can calculate r by using following expression: 𝑟 = (∑(𝑥 − 𝑥 )(𝑦 − 𝑦)) / (√∑(𝑥 − 𝑥 )2 ∑(𝑦 − 𝑦)2 ) First, we need to find sample mean of X and mean of Y values. are 𝑥 = And 𝑦 =

∑𝑦

246 9

∑𝑥 𝑛

=

82 9

= 9.11

= 27.33 = 𝑛 Now we can insert mean values into the equation and calculate the difference and squared distance terms shown in the original formula given above. (∑(𝑥 − 9.11)(𝑦 − 27.33)) 𝑟= (√∑(𝑥 − 9.11)2 ∑(𝑦 − 27.33)2 )

𝑟 =

So we have, (7 − 9.11)(21 − 27.33) + (8 − 9.11)(24 − 27.33)+. . . +(18 − 9.11)(54 − 27.33)

√[(7 − 9.11)2 + (8 − 9.11)2 + ⋯ + (18 − 9.11)2 ][(21 − 27.33)2 + (24 − 27.33)2 + ⋯ + (54 − 27.33)2 ] Use each data point and write the summations in an open form, (−2.11)(−6.33) + (−1.11)(−3.33)+. . . +(8.88)(26.67) 𝑟= √[4.45 + 1.23+. . . +79.01][40.11 + 11.11+. . . +711.11] 602.67 602.67 = 𝑟= √363207 602.67 𝑟 = +1 Correlation coefficient measures the direction and strength of a linear relationship. It varies between -1 and +1. For the given problem, we obtain the maximum possible positive linear relationship. There is a perfect positive linear relationship between X and Y. That means, all the points lie exactly on a line. Note: a correlation coefficient of 1 is not very common. Due to the existence of noise in all data, it is not very probable that a correlation coefficient is calculated to be exactly 1 or -1. The purpose of this example was to demonstrate how a formula for correlation coefficient can be used, and what two different variables with perfect correlations could look like....