Module 3 - Notes PDF

Title Module 3 - Notes
Course Introduction to Biostatistics
Institution University of Saskatchewan
Pages 13
File Size 661.3 KB
File Type PDF
Total Downloads 7
Total Views 125

Summary

notes...


Description

Statistics 246 – Module 3 Notes Module 3: Scatterplots and Correlation Bivariate Data -

For each individual studied, we record data on two variables. We then examine whether there is a relationship between these two variables: o Do changes in one variable tend to be associated with specific changes in the other variables?  Here we have two quantitative variables recorded for each 16 students:  First, how many beers they drank  Second, their resulting blood alcohol content (BAC)

o

Student ID

Number of Beers

Blood Alcohol Content

1

5

0.1

2

2

0.03

3

9

0.19

6

7

0.095

7

3

0.07

9

3

0.02

11

4

0.07

13

5

0.085

4

8

0.12

5

3

0.04

8

5

0.06

10

5

0.05

12

6

0.1

14

7

0.09

15

1

0.01

16

4

0.05

Scatterplots -

A scatterplot is used to display quantitative bivariate data. o It shoes the relationship between two quantitative variables measured on the same individuals.

-

Each variable makes up one axis.

o Each individual is a point on the graph.  Student



Beers

BAC

1

5

0.1

2

2

0.03

3

9

0.19

6

7

0.095

7

3

0.07

9

3

0.02

11

4

0.07

13

5

0.085

4

8

0.12

5

3

0.04

8

5

0.06

10

5

0.05

12

6

0.1

14

7

0.09

15

1

0.01

16

4

0.05

ID # is not a variable.

Explanatory and response variables -

A response (dependent) variable measures an outcome of a study. An explanatory (independent) variable may explain or influence changes in a response variable. o For example, for the alcohol consumption data, we are looking at the effects of number of beers on blood alcohol content.  The response is obviously the resulting blood alcohol content, and we want to see If we can explain it by number of beers drank. o When there is an obvious explanatory variable, it is plotted on the x (horizontal) axis of the scatterplot.

B lo o d A lc o h o l a s a fu n c tio n o f N u m b e r o f B e e rs

B lo o d A lc o h o l L e v e l (m g / m l )

Response BAC

y

0 .2 0 0 .1 8 0 .1 6 0 .1 4 0 .1 2 0 .1 0 0 .0 8 0 .0 6 0 .0 4 0 .0 2 0 .0 0

x

0

1

2

3

4

5

6

7

8

9

10

N u m b e r o f B e e rs

Explanatory number of beers



Interpreting scatterplots -

After plotting two variables on a scatterplot, we describe the overall pattern of the relationship. Specifically, we look for ... o Form:  linear,  curved,  clusters,  no pattern 

Linear

Non-linear or curved

No relationship

o Direction:  positive,  High values of one variable tend to occur together with high values of the other variable.





o negative,  High values of one variable tend to occur together with low values of the other variable.

o no direction

o Strength  How closely the points fit the overall form  The strength of the relationship between 2 quantitative variables refers to how much variation or scatter, there is around the main form.



 o First plot has a stronger strength in downwards linear than the second plot. o Outliers  Of the relationship  Clear deviation from the overall pattern  Recall that an outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected)

o  

In a scatterplot, outliers are points that fall outside of the overall pattern of relationship. The point highlighted does not quite fit the overall negative linear pattern.

o Reasonably strong, negative, linear relationship between lop of child mortality and log of GDP in the world  Each point represents one country We notice one outlier in the upper right corner, but overall the pattern is very clearly linear with or without the outlier.





M a n a te e d e a t h s f r o m p o w e r b o a t c o l li s io n

100

80

60

40

20

0 400

600

800

1000

P o w e r b o a ts r e g is te r e d ( x 1 ,0 0 0 )

o 

Very strong, positive, linear relationship between manatee deaths from collision with powerboats and number of powerboats registered.  No outlier.

o 

 

Mild, negative, linear relationship between the Gesell adaptive score.  A measure of intelligence in young children And the age a child first spoke. Though, the relationship is not so obvious here, because there is very few data points for late age at first word and because of an outlier at the top.

Adding categorical variables to scatterplots -

Two or more relationships can be compared on a single scatterplot when we use different symbols for groups of points on the graph. o Energy expended as a function of running speed for various treadmill inclines.  Describe this relationship.

 o For each incline, there is a very strong, positive, linear relationship between energy expenditure and speed. o In addition, the relationship between energy expenditure and speed is noticeably different for different inclines.  More energy tends to be expanded for a given running speed if the incline is steeper (uphill).

The correlation coefficient: r -

The correlation coefficient is a measure of the direction and strength of a relationship.  It is calculated using the mean and the standard deviation of both the x and y variables.

�=

�− ҧ −ത 1 σ ത�=1 � ത � ത ത �−1 തത �ത

ú   -

തത = SD of the ത variable തത = SD of the ത variable

Correlation can only be used to describe quantitative variables.  Categorical variables do not have means and standard deviations.

Time to swim: �ҧ = 35, �� = 0.7 Pulse rate: �ത = 140, ത� = 9.5

o Vertical bar is  Error bar for pulse rate Horizontal bar is  Error bar for time

 

R doesn’t distinguish explanatory and response variables -

For the first graph: R treats x and y symmetrically

�= -

1 �−1

σത�=1

ത ��−തҧ �ത−ത ��

ത�

For the second graph: ú

r = -0.75



R has no unit

r = -0.75

Here time to swim is the explanatory variable and should belong on the x axis. o However in either plot, r is the same.

�=

1 �−1

σത �=1

standardized value of x (unit less)

-

ത ��−ത ҧ � ത−ത �� ത�

standardized value of y (unit less)

Changing the units of variables does not change the correlation coefficient r, because we get rid of all units when we standardize.

r = -0.75



r = -0.75



R ranges from -1 to +1 -

Strength is indicated by the absolute value of r  The closer r is to zero, the weaker the relationship is. Direction is indicated by the sign of r (+ or -)  R is positive for linear relationships, and negative for negative linear relationships.

-

R has this meaning for linear relationships only.



R is not resistant to outliers -

Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers.

r = –0.91

r = –0.75

 ú

Just by moving one point from the linear pattern, it weakens the correlation from -0.91 to -0.75

Example (calculation of r) -

Because elderly people may have difficulty standing straight to have their height measured, a study looked at the relationship between overall height and height to the knee.  Here are the data (in cm) for five elderly men. ú Knee height, x

57.7

47.4

43.5

44.8

55.2

Overall height, y

192.1

153.3

146.4

162.7

169.1



Find r 1

ത � =�−1 σ �=1

�� −ത ҧ � ത−ത ത �� ത�

ത can also be expressed as

ത=



σ ത ത σത ത σ ത�=1 തതതത− ത=1 ത ത=1 ത ത

σ തത=1 ത2ത−

2 σ തത =1 ത ത ത



σ തത σ തത=1 തത2− ത=1

2



o This formula is easier to use for computation.  But both are the same.

o 5 �ത = 57.7 + 47.4 + ⋯+ 55.2 = 248.6 σ �=1

σ ത5=1 ത2ത = (57.7)2+(47.4) 2+⋯+ 55.2

2

= 12522.38

σ ത5=1 �ത = 823.6 σ ത5=1 ത2ത = 136902.4 σ ത5=1 തതതത = 57.7 192.1 + 47.4 153.3 + ⋯+ 55.2 169.1 = 41342.27

o �=

σ ത � σത � σ��=1�ത� ത− ത=1 ത ത=1 ത ത

σ ത തത σ തത=1 ത2ത− ത=1 ത

=

2



തത σ σതത=1 തത2− ത=1 ത

2

=

248.6 823.6 5 823.6 2 248.6 2 136902.4− 5 12522.38− 5

41342.27−

392.878 = 0.877 200697.9



A strong, positive, linear relationship, between knee height and overall height....


Similar Free PDFs