Title | Chapter 5 - Bivariate Analysis (students\' notes) |
---|---|
Course | Statistics For Science And Engineering |
Institution | Universiti Teknologi MARA |
Pages | 12 |
File Size | 534.8 KB |
File Type | |
Total Downloads | 25 |
Total Views | 140 |
Educators are interested in determining whether the number of hours a student studies is related to the student’s score on a particular exam....
STA408: Statistics for Science and Engineering
Chapter 5: Bivariate Analysis Example 1: Educators are interested in determining whether the number of hours a student studies is related to the student’s score on a particular exam. Medical researchers are interested in questions such as, Is caffeine related to heart damage? Is there a relationship between a person’s age and his or her blood pressure? A zoologist may want to know whether the birth weight of a certain animal is related to its life span. In an industrial situation, an engineer may want to know if the tar content in the outlet stream in a chemical process is related to the inlet temperature. These are only a few of the many questions that can be answered by using the techniques of correlation and regression analysis. Definitions Correlation A statistical method used to determine whether a relationship between variables exists. Regression A statistical method used to describe the nature of the relationship between variables, that is, positive or negative, linear or nonlinear. At the end of this chapter, we should be able to answer these questions statistically: Are the two variables related? If so, what is the strength of the relationship? What type of relationship exists? What kind of predictions can be made from the relationship? To answer the first two questions, statisticians use a correlation coefficient, i.e., a numerical measure to determine whether the two variables are related and to determine the strength of the relationship between or among the variables. To answer the third question, there are two types of relationships, i.e., simple and multiple but we will only consider simple relationship here. Hence, in a simple relationship, there are two variables: an independent variable, also called an explanatory variable or a predictor variable and a dependent variable, also called a response variable. A simple relationship analysis is called simple regression, and there is one independent variable that is used to predict the dependent variable. Simple relationships can also be positive or negative. A positive relationship exists when both variables increase or decrease at the same time. A negative relationship, as one variable increases, the other variable decreases, and vice versa. Finally, the fourth question asks what type of predictions can be made. Predictions are made in all areas and daily. Examples include weather forecasting, stock market analyses, sales predictions, crop predictions, gasoline price predictions, and sports predictions. Some predictions are more accurate than others, due to the strength of the relationship. That is, the stronger the relationship is between variables, the more accurate the prediction is.
STA408
5.1
Chapter 5: Bivariate Analysis
Linear Correlation Coefficient (Pearson Product Moment Correlation Coefficient) Linear correlation measures the strength of the linear association between two variables. The correlation coefficient calculated for the population data is denoted by 𝝆. The correlation coefficient calculated for the sample data is denoted by r. The value of the correlation coefficient always lies in the range −1 and 1, i.e., −1 ≤ 𝜌 ≤ 1 and −1 ≤ 𝑟 ≤ 1.
Linear correlation between two variables
𝑟=1 Perfect positive linear correlation
𝑟 = −1 Perfect negative linear correlation
𝑟=0 No linear correlation
Formula to calculate Linear Correlation Coefficient The simple linear correlation coefficient, denoted by 𝑟 measures the strength of the linear relationship between two variables for a sample and is calculated as 𝑆𝑆𝑥𝑦 𝑟= √𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦 where 𝑆𝑆𝑥𝑦 = ∑ 𝑥𝑦 −
∑𝑥∑𝑦 𝑛
,
𝑆𝑆𝑥𝑥 = ∑ 𝑥 2 −
2
(∑ 𝑥)2 𝑛
and
𝑆𝑆𝑦𝑦 = ∑ 𝑦 2 −
(∑ 𝑦)2 𝑛
STA408
Chapter 5: Bivariate Analysis
Assumptions for the Correlation Coefficient The sample is a random sample. The data pairs fall approximately on a straight line and are measured at the interval and ratio level. The variables have a joint normal distribution, i.e., given any 𝑥 value, 𝑦 values are normally distributed and vice versa. Example 2: The data shown below are the number of cars owned and the revenue received for car rental companies in Malaysia for a recent year while Figure 1 displays the scatter plot for the data. Compute the correlation coefficient for the data and interpret the value. Company A B C D E F
Cars (in ten thousands) 63.0 29.0 20.8 19.1 13.4 8.5
Revenue (in billions) RM 7.0 RM 3.9 RM 2.1 RM 2.8 RM 1.4 RM 1.5
Car Rental Companies 7
Revenue (billions)
6
5
4
3
2
1 10
20
30
40
50
60
70
Car (in 10,000s)
Figure 1: Scatter plot for revenue (in billions) vs car (in ten thousands).
3
STA408
Chapter 5: Bivariate Analysis
Example 3: The data below was obtained in a study on the number of absences and the final grades of seven randomly selected students from a statistic class and the scatter diagram is as given in Figure 2. Compute the correlation coefficient for the data and interpret the value. Number of absences, 𝒙 6 2 15 9 12 5 8
Student A B C D E F G
Final grade, 𝒚 (%) 82 86 43 74 58 90 78
Students from a Statistic Class 90
Final grade (%)
80
70
60
50
40 2
4
6
8
10
12
14
16
Number of absences
Figure 2: Scatter diagram of number of absences versus final grades of statistic students.
4
STA408
Chapter 5: Bivariate Analysis
Example 4: A researcher wishes to see if there is a relationship between the number of hours that nine people exercise each week and the amount of milk each person consumes per week. The data are as shown and the scatter plot is presented in Figure 3. Compute the correlation coefficient for the data and interpret the value. Hours, 𝒙 3 0 2 5 8 5 10 2 1
Person A B C D E F G H I
Amount, 𝒚 48 8 32 64 10 32 56 72 48
Exercise and Milk Consumption 80 70 60
Amount
50 40 30 20 10 0 0
2
4
6
8
10
Hours
Figure 3: Scatter plot for exercise and milk consumption.
5
STA408
5.2
Chapter 5: Bivariate Analysis
Simple Linear Regression A simple regression model includes only two variables, i.e., one independent and one dependent. The dependent variable is the one being explained while the independent variable is the one used to explain the variation in the dependent variable. A (simple) regression model that gives a straight-line relationship between two variables is called a linear regression model. The purpose of the regression line is to enable the researcher to see the trend and make predictions on the basis of the data.
Line of the Best Fit
The figure above shows that several lines can be drawn on the graph near the points. As such, when a scatter plot, we must be able to draw the line of the best fit.
The best fit means the sum of squares of the vertical distances from each point to the line is a minimum.
Equation of a linear relationship The equation of a linear relationship between two variables 𝑥 and 𝑦 is given as 𝑦 = 𝑎 + 𝑏𝑥
For example, if 𝑎 = 50 and 𝑏 = 5, then 𝑦 = 50 + 5𝑥 and the plot of this linear equation is illustrated in Figure 4.
Figure 4: The plot of the linear equation 𝑦 = 50 + 5𝑥 .
6
STA408
Chapter 5: Bivariate Analysis
In Figure 4,
The line intersects the 𝑦-axis at which is the value of 𝑦 when
, called the 𝑦-intercept and is given by the .
In the equation 𝑦 = 50 + 5𝑥, the value 5 is the It gives the amount of the in 𝑦 due to the For example,
If 𝑥 = 10, then 𝑦 =
of 𝑥 or the
in 𝑥.
term, of the line.
.
If 𝑥 = 11, then 𝑦 =
.
Hence, as 𝑥 increases by , 𝑦 increases by value of 𝑥 and such changes in 𝑥 and 𝑦 are shown in Figure 5.
. This is true for any
Figure 5: The 𝑦-intercept and slope of a linear regression model. Simple Linear Regression Model In a regression model, the independent variable is usually denoted by 𝑥 and the dependent variable is usually denoted by 𝑦. Hence, a simple linear regression model is written as 𝑦 = 𝑎 + 𝑏𝑥
where 𝑎 is the constant term or 𝑦-intercept and 𝑏 is the slope.
In general, 𝑎 gives the value of 𝑦 when 𝑥 = 0 while 𝑏 gives the change in 𝑦 due to change of one unit in 𝑥. The least squares line In regression analysis, we try to find a line that best fits the points in the scatter diagram. Such a line provides the best possible description of the relationship between the dependent and independent variables. The least squares method gives such a line and the line obtained by using the least squares method is called the least squares regression line. For the least squares regression line 𝑦 = 𝑎 + 𝑏𝑥, 𝑏= where
𝑆𝑆𝑥𝑦 𝑆𝑆𝑥𝑥
𝑆𝑆𝑥𝑦 = ∑ 𝑥𝑦 −
∑𝑥∑𝑦 𝑛
and
𝑎 = 𝑦 − 𝑏𝑥 𝑆𝑆𝑥𝑥 = ∑ 𝑥 2 −
,
(∑ 𝑥)2 𝑛
The least squares regression line 𝑦 = 𝑎 + 𝑏𝑥 is also called the regression of 𝑦 on 𝑥.
7
STA408
Chapter 5: Bivariate Analysis
Example 5: Refer to the data given in Example 2 where the data shown below are the number of cars owned and the revenue received for car rental companies in Malaysia for a recent year. Company A B C D E F
(a) (b) (c)
Cars (in ten thousands) 63.0 29.0 20.8 19.1 13.4 8.5
Revenue (in billions) RM 7.0 RM 3.9 RM 2.1 RM 2.8 RM 1.4 RM 1.5
Find the equation of the regression line. Interpret the slope of the regression line. Use the equation of the regression line to predict the income of a car rental agency that has 200,000 automobiles.
8
STA408
Chapter 5: Bivariate Analysis
Coefficient of Determination The coefficient of determination, denoted by 𝑅 2 , represents the proportion of variation in dependent variable, 𝑦 that can be explained by the independent variable, 𝑥. The computational formula for 𝑅 2 is 𝑏𝑆𝑆𝑥𝑦 𝑅2 = 𝑆𝑆𝑦𝑦 and 0 ≤ 𝑅2 ≤ 1. Note: The calculation of 𝑅 2 using formula is not required in the exam. The value can be obtained from the Minitab output or by squaring the value of the correlation coefficient, r. Example 6: Calculate the coefficient of determination for the data given in Example 2 and interpret the value. ∑ 𝑥 = 153.8,
∑ 𝑥 2 = 5859.26,
∑ 𝑦 = 18.7,
∑ 𝑦 2 = 80.67,
∑ 𝑥𝑦 = 682.77.
SS𝑥𝑦 = 203.4267 𝑏 = 0.1061
SS𝑦𝑦 = 22.3883 𝑅2 =
𝑏SS𝑥𝑦 = SS𝑦𝑦
Interpretation:
The following figure shows the least square regression line found in Example 5. Fitted Line Plot Revenue (billions) = 0.3963 + 0.1061 Car (in 10,000s) S R-Sq R-Sq(adj)
7
Revenue (billions)
6
5
4
3
2
1 10
20
30
40
50
Car (in 10,000s)
9
60
70
0.447106 96.4% 95.5%
STA408
Chapter 5: Bivariate Analysis
Test for Adequacy of a Linear Model – Analysis of Variance (ANOVA) Approach The ANOVA table for regression analysis is given as follows. ANOVA table for Testing the Adequacy of a Linear Model Degrees of Freedom
Sum of Squares
Mean Square
1
SSR
SSR
Error
𝑛−2
SSE
Total
𝑛−1
SST
Source of variation Regression
𝑠2 =
Test statistic, 𝑭 𝐹=
SSE 𝑛−2
SSR s2
where SST = SS𝑦𝑦 , SSR = 𝑏 SS𝑥𝑦 and SSE = SST − SSR Example 7: Refer to the data in Example 5, construct the ANOVA table for Testing the Adequacy of a Linear Model. ∑ 𝑥 = 153.8,
∑ 𝑦 = 18.7,
∑ 𝑥 2 = 5859.26,
∑ 𝑦 2 = 80.67,
∑ 𝑥𝑦 = 682.77.
𝑏=
SSR = 𝑏 SS𝑥𝑦 =
SST = SS𝑦𝑦 = ∑ 𝑦2 −
Source of variation
(∑ 𝑦)2 𝑛
=
Degrees of Freedom
Sum of Squares
Mean Square
Test statistic, 𝑭 𝐹=
Regression 𝑠2 =
Error Total
10
STA408
Chapter 5: Bivariate Analysis
Below is the Minitab output for Example 5. Regression Analysis: Revenue (billions) versus Car (in 10,000s) Analysis of Variance Source Regression Car (in 10,000s) Error Total
DF 1 1 4 5
Adj SS 21.5887 21.5887 0.7996 22.3883
Adj MS 21.5887 21.5887 0.1999
F-Value 108.00 108.00
P-Value 0.000 0.000
Model Summary S 0.447106
R-sq 96.43%
R-sq(adj) 95.54%
R-sq(pred) 91.67%
Coefficients Term Constant Car (in 10,000s)
Coef 0.396 0.1061
SE Coef 0.319 0.0102
T-Value 1.24 10.39
P-Value 0.282 0.000
VIF 1.00
Regression Equation Revenue (billions) = 0.396 + 0.1061 Car (in 10,000s)
Example 8: Refer to the ANOVA table in Example 7. Test at 5% level of significance if the linear regression model is significant.
Example 9: Refer to the data in Example 3 where the data was obtained in a study on the number of absences and the final grades of seven randomly selected students from a statistic class. Student A B C D E F G
Number of absences, 𝒙 6 2 15 9 12 5 8
11
Final grade, 𝒚 (%) 82 86 43 74 58 90 78
STA408
Chapter 5: Bivariate Analysis
The Minitab output for the data is as shown below. Regression Analysis: Final grade (%) versus Number of absences Analysis of Variance Source Regression Number of absences Error Total
DF 1 1 5 6
Adj SS 1506.7 1506.7 183.3 1690.0
Adj MS 1506.71 1506.71 36.66
F-Value 41.10 41.10
P-Value 0.001 0.001
Model Summary S 6.05464
R-sq 89.15%
R-sq(adj) 86.99%
R-sq(pred) 67.62%
Coefficients Term Constant Number of absences
Coef 102.49 -3.622
SE Coef 5.14 0.565
T-Value 19.95 -6.41
P-Value 0.000 0.001
VIF 1.00
Based on the Minitab output, answer the following questions (a) State the independent and dependent variable. (b) Write down the regression equation. (c) Show by calculation that the slope value is −3.622 and interpret its value in the context of the problem. (d) Determine the coefficient of correlation. (e) State the coefficient of determination and explain its meaning. (f) Based on the regression equation, estimate the final grade (%) of a student who was absent from class seven times in a semester. (g) Perform a test to determine whether the linear regression model is significant. Use 5% level of significance.
12...