Chapter 5 - Bivariate Analysis (students\' notes) PDF

$Chapter 5 - Bivariate Analysis (students\' notes)$

Title	Chapter 5 - Bivariate Analysis (students\' notes)
Course	Statistics For Science And Engineering
Institution	Universiti Teknologi MARA
Pages	12
File Size	534.8 KB
File Type	PDF
Total Downloads	25
Total Views	140

Preview

CLICK TO PREVIEW PDF

Summary

Educators are interested in determining whether the number of hours a student studies is related to the student’s score on a particular exam....

Description

STA408: Statistics for Science and Engineering

Chapter 5: Bivariate Analysis Example 1:  Educators are interested in determining whether the number of hours a student studies is related to the student’s score on a particular exam.  Medical researchers are interested in questions such as, Is caffeine related to heart damage? Is there a relationship between a person’s age and his or her blood pressure?  A zoologist may want to know whether the birth weight of a certain animal is related to its life span.  In an industrial situation, an engineer may want to know if the tar content in the outlet stream in a chemical process is related to the inlet temperature. These are only a few of the many questions that can be answered by using the techniques of correlation and regression analysis. Definitions  Correlation A statistical method used to determine whether a relationship between variables exists.  Regression A statistical method used to describe the nature of the relationship between variables, that is, positive or negative, linear or nonlinear. At the end of this chapter, we should be able to answer these questions statistically:  Are the two variables related?  If so, what is the strength of the relationship?  What type of relationship exists?  What kind of predictions can be made from the relationship? To answer the first two questions, statisticians use a correlation coefficient, i.e., a numerical measure to determine whether the two variables are related and to determine the strength of the relationship between or among the variables. To answer the third question, there are two types of relationships, i.e., simple and multiple but we will only consider simple relationship here. Hence, in a simple relationship, there are two variables:  an independent variable, also called an explanatory variable or a predictor variable and  a dependent variable, also called a response variable. A simple relationship analysis is called simple regression, and there is one independent variable that is used to predict the dependent variable. Simple relationships can also be positive or negative.  A positive relationship exists when both variables increase or decrease at the same time.  A negative relationship, as one variable increases, the other variable decreases, and vice versa. Finally, the fourth question asks what type of predictions can be made. Predictions are made in all areas and daily. Examples include weather forecasting, stock market analyses, sales predictions, crop predictions, gasoline price predictions, and sports predictions. Some predictions are more accurate than others, due to the strength of the relationship. That is, the stronger the relationship is between variables, the more accurate the prediction is.

STA408

5.1    

Chapter 5: Bivariate Analysis

Linear Correlation Coefficient (Pearson Product Moment Correlation Coefficient) Linear correlation measures the strength of the linear association between two variables. The correlation coefficient calculated for the population data is denoted by 𝝆. The correlation coefficient calculated for the sample data is denoted by r. The value of the correlation coefficient always lies in the range −1 and 1, i.e., −1 ≤ 𝜌 ≤ 1 and −1 ≤ 𝑟 ≤ 1.

Linear correlation between two variables

𝑟=1 Perfect positive linear correlation

𝑟 = −1 Perfect negative linear correlation

𝑟=0 No linear correlation

Formula to calculate Linear Correlation Coefficient The simple linear correlation coefficient, denoted by 𝑟 measures the strength of the linear relationship between two variables for a sample and is calculated as 𝑆𝑆𝑥𝑦 𝑟= √𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦 where 𝑆𝑆𝑥𝑦 = ∑ 𝑥𝑦 −

∑𝑥∑𝑦 𝑛

,

𝑆𝑆𝑥𝑥 = ∑ 𝑥 2 −

2

(∑ 𝑥)2 𝑛

and

𝑆𝑆𝑦𝑦 = ∑ 𝑦 2 −

(∑ 𝑦)2 𝑛

STA408

Chapter 5: Bivariate Analysis

Assumptions for the Correlation Coefficient  The sample is a random sample.  The data pairs fall approximately on a straight line and are measured at the interval and ratio level.  The variables have a joint normal distribution, i.e., given any 𝑥 value, 𝑦 values are normally distributed and vice versa. Example 2: The data shown below are the number of cars owned and the revenue received for car rental companies in Malaysia for a recent year while Figure 1 displays the scatter plot for the data. Compute the correlation coefficient for the data and interpret the value. Company A B C D E F

Cars (in ten thousands) 63.0 29.0 20.8 19.1 13.4 8.5

Revenue (in billions) RM 7.0 RM 3.9 RM 2.1 RM 2.8 RM 1.4 RM 1.5

Car Rental Companies 7

Revenue (billions)

6

5

4

3

2

1 10

20

30

40

50

60

70

Car (in 10,000s)

Figure 1: Scatter plot for revenue (in billions) vs car (in ten thousands).

3

STA408

Chapter 5: Bivariate Analysis

Example 3: The data below was obtained in a study on the number of absences and the final grades of seven randomly selected students from a statistic class and the scatter diagram is as given in Figure 2. Compute the correlation coefficient for the data and interpret the value. Number of absences, 𝒙 6 2 15 9 12 5 8

Student A B C D E F G

Final grade, 𝒚 (%) 82 86 43 74 58 90 78

Students from a Statistic Class 90

Final grade (%)

80

70

60

50

40 2

4

6

8

10

12

14

16

Number of absences

Figure 2: Scatter diagram of number of absences versus final grades of statistic students.

4

STA408

Chapter 5: Bivariate Analysis

Example 4: A researcher wishes to see if there is a relationship between the number of hours that nine people exercise each week and the amount of milk each person consumes per week. The data are as shown and the scatter plot is presented in Figure 3. Compute the correlation coefficient for the data and interpret the value. Hours, 𝒙 3 0 2 5 8 5 10 2 1

Person A B C D E F G H I

Amount, 𝒚 48 8 32 64 10 32 56 72 48

Exercise and Milk Consumption 80 70 60

Amount

50 40 30 20 10 0 0

2

4

6

8

10

Hours

Figure 3: Scatter plot for exercise and milk consumption.

5

STA408

5.2 

 

Chapter 5: Bivariate Analysis

Simple Linear Regression A simple regression model includes only two variables, i.e., one independent and one dependent. The dependent variable is the one being explained while the independent variable is the one used to explain the variation in the dependent variable. A (simple) regression model that gives a straight-line relationship between two variables is called a linear regression model. The purpose of the regression line is to enable the researcher to see the trend and make predictions on the basis of the data.

Line of the Best Fit

The figure above shows that several lines can be drawn on the graph near the points. As such, when a scatter plot, we must be able to draw the line of the best fit.

The best fit means the sum of squares of the vertical distances from each point to the line is a minimum.

Equation of a linear relationship The equation of a linear relationship between two variables 𝑥 and 𝑦 is given as 𝑦 = 𝑎 + 𝑏𝑥

For example, if 𝑎 = 50 and 𝑏 = 5, then 𝑦 = 50 + 5𝑥 and the plot of this linear equation is illustrated in Figure 4.

Figure 4: The plot of the linear equation 𝑦 = 50 + 5𝑥 .

6

STA408

Chapter 5: Bivariate Analysis

In Figure 4,  

The line intersects the 𝑦-axis at which is the value of 𝑦 when

, called the 𝑦-intercept and is given by the .

In the equation 𝑦 = 50 + 5𝑥, the value 5 is the It gives the amount of the in 𝑦 due to the  For example, 

If 𝑥 = 10, then 𝑦 =

of 𝑥 or the

in 𝑥.

term, of the line.

.

If 𝑥 = 11, then 𝑦 =

.

Hence, as 𝑥 increases by , 𝑦 increases by value of 𝑥 and such changes in 𝑥 and 𝑦 are shown in Figure 5.

. This is true for any

Figure 5: The 𝑦-intercept and slope of a linear regression model. Simple Linear Regression Model In a regression model, the independent variable is usually denoted by 𝑥 and the dependent variable is usually denoted by 𝑦. Hence, a simple linear regression model is written as 𝑦 = 𝑎 + 𝑏𝑥

where 𝑎 is the constant term or 𝑦-intercept and 𝑏 is the slope.

In general, 𝑎 gives the value of 𝑦 when 𝑥 = 0 while 𝑏 gives the change in 𝑦 due to change of one unit in 𝑥. The least squares line In regression analysis, we try to find a line that best fits the points in the scatter diagram. Such a line provides the best possible description of the relationship between the dependent and independent variables. The least squares method gives such a line and the line obtained by using the least squares method is called the least squares regression line. For the least squares regression line 𝑦 = 𝑎 + 𝑏𝑥, 𝑏= where

𝑆𝑆𝑥𝑦 𝑆𝑆𝑥𝑥

𝑆𝑆𝑥𝑦 = ∑ 𝑥𝑦 −

∑𝑥∑𝑦 𝑛

and

𝑎 = 𝑦 − 𝑏𝑥 𝑆𝑆𝑥𝑥 = ∑ 𝑥 2 −

,

(∑ 𝑥)2 𝑛

The least squares regression line 𝑦 = 𝑎 + 𝑏𝑥 is also called the regression of 𝑦 on 𝑥.

7

STA408

Chapter 5: Bivariate Analysis

Example 5: Refer to the data given in Example 2 where the data shown below are the number of cars owned and the revenue received for car rental companies in Malaysia for a recent year. Company A B C D E F

(a) (b) (c)

Cars (in ten thousands) 63.0 29.0 20.8 19.1 13.4 8.5

Revenue (in billions) RM 7.0 RM 3.9 RM 2.1 RM 2.8 RM 1.4 RM 1.5

Find the equation of the regression line. Interpret the slope of the regression line. Use the equation of the regression line to predict the income of a car rental agency that has 200,000 automobiles.

8

STA408

Chapter 5: Bivariate Analysis

Coefficient of Determination The coefficient of determination, denoted by 𝑅 2 , represents the proportion of variation in dependent variable, 𝑦 that can be explained by the independent variable, 𝑥. The computational formula for 𝑅 2 is 𝑏𝑆𝑆𝑥𝑦 𝑅2 = 𝑆𝑆𝑦𝑦 and 0 ≤ 𝑅2 ≤ 1. Note: The calculation of 𝑅 2 using formula is not required in the exam. The value can be obtained from the Minitab output or by squaring the value of the correlation coefficient, r. Example 6: Calculate the coefficient of determination for the data given in Example 2 and interpret the value. ∑ 𝑥 = 153.8,

∑ 𝑥 2 = 5859.26,

∑ 𝑦 = 18.7,

∑ 𝑦 2 = 80.67,

∑ 𝑥𝑦 = 682.77.

SS𝑥𝑦 = 203.4267 𝑏 = 0.1061

SS𝑦𝑦 = 22.3883 𝑅2 =

𝑏SS𝑥𝑦 = SS𝑦𝑦

Interpretation:

The following figure shows the least square regression line found in Example 5. Fitted Line Plot Revenue (billions) = 0.3963 + 0.1061 Car (in 10,000s) S R-Sq R-Sq(adj)

7

Revenue (billions)

6

5

4

3

2

1 10

20

30

40

50

Car (in 10,000s)

9

60

70

0.447106 96.4% 95.5%

STA408

Chapter 5: Bivariate Analysis

Test for Adequacy of a Linear Model – Analysis of Variance (ANOVA) Approach The ANOVA table for regression analysis is given as follows. ANOVA table for Testing the Adequacy of a Linear Model Degrees of Freedom

Sum of Squares

Mean Square

1

SSR

SSR

Error

𝑛−2

SSE

Total

𝑛−1

SST

Source of variation Regression

𝑠2 =

Test statistic, 𝑭 𝐹=

SSE 𝑛−2

SSR s2

where SST = SS𝑦𝑦 , SSR = 𝑏 SS𝑥𝑦 and SSE = SST − SSR Example 7: Refer to the data in Example 5, construct the ANOVA table for Testing the Adequacy of a Linear Model. ∑ 𝑥 = 153.8,

∑ 𝑦 = 18.7,

∑ 𝑥 2 = 5859.26,

∑ 𝑦 2 = 80.67,

∑ 𝑥𝑦 = 682.77.

𝑏=

SSR = 𝑏 SS𝑥𝑦 =

SST = SS𝑦𝑦 = ∑ 𝑦2 −

Source of variation

(∑ 𝑦)2 𝑛

=

Degrees of Freedom

Sum of Squares

Mean Square

Test statistic, 𝑭 𝐹=

Regression 𝑠2 =

Error Total

10

STA408

Chapter 5: Bivariate Analysis

Below is the Minitab output for Example 5. Regression Analysis: Revenue (billions) versus Car (in 10,000s) Analysis of Variance Source Regression Car (in 10,000s) Error Total

DF 1 1 4 5

Adj SS 21.5887 21.5887 0.7996 22.3883

Adj MS 21.5887 21.5887 0.1999

F-Value 108.00 108.00

P-Value 0.000 0.000

Model Summary S 0.447106

R-sq 96.43%

R-sq(adj) 95.54%

R-sq(pred) 91.67%

Coefficients Term Constant Car (in 10,000s)

Coef 0.396 0.1061

SE Coef 0.319 0.0102

T-Value 1.24 10.39

P-Value 0.282 0.000

VIF 1.00

Regression Equation Revenue (billions) = 0.396 + 0.1061 Car (in 10,000s)

Example 8: Refer to the ANOVA table in Example 7. Test at 5% level of significance if the linear regression model is significant.

Example 9: Refer to the data in Example 3 where the data was obtained in a study on the number of absences and the final grades of seven randomly selected students from a statistic class. Student A B C D E F G

Number of absences, 𝒙 6 2 15 9 12 5 8

11

Final grade, 𝒚 (%) 82 86 43 74 58 90 78

STA408

Chapter 5: Bivariate Analysis

The Minitab output for the data is as shown below. Regression Analysis: Final grade (%) versus Number of absences Analysis of Variance Source Regression Number of absences Error Total

DF 1 1 5 6

Adj SS 1506.7 1506.7 183.3 1690.0

Adj MS 1506.71 1506.71 36.66

F-Value 41.10 41.10

P-Value 0.001 0.001

Model Summary S 6.05464

R-sq 89.15%

R-sq(adj) 86.99%

R-sq(pred) 67.62%

Coefficients Term Constant Number of absences

Coef 102.49 -3.622

SE Coef 5.14 0.565

T-Value 19.95 -6.41

P-Value 0.000 0.001

VIF 1.00

Based on the Minitab output, answer the following questions (a) State the independent and dependent variable. (b) Write down the regression equation. (c) Show by calculation that the slope value is −3.622 and interpret its value in the context of the problem. (d) Determine the coefficient of correlation. (e) State the coefficient of determination and explain its meaning. (f) Based on the regression equation, estimate the final grade (%) of a student who was absent from class seven times in a semester. (g) Perform a test to determine whether the linear regression model is significant. Use 5% level of significance.

12...