Correlation Analysis - Lecture notes 2 PDF

Title Correlation Analysis - Lecture notes 2
Author Student Student
Course Agriculture
Institution CSK Himachal Pradesh Agricultural University
Pages 9
File Size 265.7 KB
File Type PDF
Total Downloads 5
Total Views 147

Summary

Veterinary Statistics...


Description

CORRELATION ANALYSIS Two or more variable are said to be correlated, if change in one is accompanied by change in other (s). Correlation analysis involves 2 or > 2 variables. Example: 1. Rainfall and crop production 2. Price of a commodity and its demand. 3. Age and height/ Age and body weight of an individual. 4. Feeding level and milk production of a cow. Correlation analysis: Correlation analysis measures degree of relationship between 2 variables. Correlation coefficient: The measure of correlation/association is the correlation coefficient or correlation index.

It measures the degree and direction of relationship/correlation among

variables. Definitions of correlation: Correlation is defined in many ways: 1. Correlation analysis deals with association between 2 or more variables (Simpson and Kafka). 2. Correlation analysis determines the degree of relationship between variables (Ya Lun Chou). 3. Correlation is the analysis of co-variation between 2 or more variables (AM Tuttle). 4. If two or more quantities vary so that movements in one is accompanied by corresponding movement in other(s), then they are said to be correlation (L R Conner). 5. When relationship is of quantitative nature, the statistical tool for measuring and expressing this relationship in formula is called correlation (Croxton and Cowden). Thus, correlation is a statistical measure to analyze co-variation between 2 or more variables. Analyzing relationship or correlation between series/variables: The Analysis of relationship or correlation between series/variables involves following 3 steps: 1. To determine whether relationship exists and then measuring it. 2. To test its significance 3. To establish cause and effect relationship, if any. Example: Smoking causing Lung cancer cannot be proved only by increase in smoking and increase in lung cancer cases. Significance of correlation analysis: 1. It measures the degree of relationship between variables in one figure as most variables show some kind of relationship. Example: price and supply, income and expenditure. 2. To estimate the value of one variable from other variable, if two variables are related. (By Regression Analysis)

3. It helps in understanding economic behavior/ trends and identifying important variables on which others depend. Example: Inflation, Price index.

Causes of correlation: Correlation analysis determines the degree of relationship between 2 or more variables (but says nothing about cause and effect relationship), i.e. it establishes only covariation. Significant correlation can be due to either one or combination of following reasons: 1. Purely by chance, especially in small samples but in large universe there will be no relationship by chance. Example: Income (Rs.)

20,000

25,000

30,000

Weight (Kg)

60

70

80

This is also termed as “non- sense correlation” between 2 variables having no relevance or justification and is purely by chance. 2. Both the correlated variables are influenced by one or more common variables. Same cause affecting each variable or different cause affecting each with same effect. Example: High yield of Rice and Tea related to high rainfall. (But none is cause of other). 3. Two variables mutually influencing each other (not possible to identify cause and effect). Example: Demand and supply, price and production.

Types of correlation: The correlation is classified in different ways: 1. Positive and Negative correlation (Direction of correlation) 2. Simple, Partial and Multiple correlation (Variables involved) 3. Linear and Nonlinear correlation (Ratio of change/ Change) (1)

Positive and Negative correlation: (A) . Positive correlation: If both variables are changing in same direction i.e. increases in one variable on an average leads to increase in other also or if one variable is decreasing, the other on an average is also decreasing, it is said to be positive correlation. (B) . Negative correlation: If variables are varying in opposite direction i.e. increase in one is leading to decrease in other or vice-versa, it is negative correlation.

(2)

Simple, Partial or Multiple correlations: The difference is on basis of number of variable studied: (A) . Simple correlation: When only two variables are involved, it is Simple correlation. (B) . Multiple or partial correlation: When 3 or more variables are studied, it is a problem of Multiple or Partial correlation. Example: Yield of rice/acre in relation to both amount of rainfall and amount of fertilizer applied (Multiple correlation).

Partial correlation: > 2 variables are involved but only two variables are considered influencing each other, the effect of other variables is kept constant. (3)

Linear and non-linear (curvi-linear) correlation: The difference is on basis of constancy of ratio of change between the variables.

(A). Linear correlation: If amount of change in one variable have constant ratio to change in other variable, it is linear correlation. Variable X

10

20

30

40

50

Variable Y

70

140

210

280

350

The ratio of change is constant and if plotted on graph all points when joined will make a straight line. Change in independent variable leads to constant change in dependent variable. (B). Non-linear or curvi-linear correlation: If the amount of change in one variable does not have constant ratio to change in other variable, the correlation is non-linear or curvi-linear. Example: Double the amount of rainfall may not necessarily double rice yield. Practically, in many situations, the relationship is non-linear.

Methods to study correlation: Various methods used to study whether 2 variables are correlated or not are: 1. Scatter diagram method

Based on diagrams and graphs

2. Graphic method 3. Karl Pearson’s Co-efficient of correlation. 4. Concurrent deviation method

Mathematical methods

5. Method of least squares: Usually not followed in use 1.

Scatter Diagram Method: In this, the given data are plotted on a graph paper or scatter plot by dots for each pair of X and Y values, getting as many points as number of observations. The correlation indicates the scatter of various points. The greater is the scatter of plotted points on chart/graph paper, lesser is the relationship between two variables.

The more closely the points come to a straight line from lower left hand corner to upper right hand corner, the correlation is said to be perfectly positive (i.e. r=+1). If all points are in a straight line from upper left hand corner to lower right hand corner, the correlation is perfectly negative (i.e. r = -1). If plotted points fall in a narrow band, it indicates high correlation, +ve bands have rising tendency from lower left hand corner and –ve if it has declining tendency. If the points are widely scattered over the graph/diagram, there is very little correlation.

Advantages/ Merits of scatter diagram method: 1. Simple, non-mathematical method, easily understood and giving quick but rough idea of whether or not the variables are related. 2. Not influenced by extreme items or values (Mathematical methods are influenced by extreme items). 3. First step in knowing whether two variables have relationship or not. Limitations: 1. The exact degree of correlation cannot be known. It simply gives idea about direction and low or high magnitude of correlation.

2. Graphic Method: The individual values for two variables ‘X’ and ‘Y’ are plotted on a graph paper, getting two lines/curves, one for X and other for ‘Y’ variable. By seeing direction and closeness of two lines, inference is drawn, if two variables are related or not. If both lines move in same direction (either upward or downwards) the correlation is “positive”; if lines move in opposite directions, the correlation is “negative”. This method is used when data is given over a period of time or in time series e.g.: income and expenditure over several years.

3. Coefficient of correlation: Karl Pearson’s coefficient of correlation (denoted by symbol r) is most widely used mathematical method for measuring correlation. It describes degree of correlation/relationship between two series / variables. It is based on assumption that population is normally distributed. Formula for computing Pearson correlation coefficient ‘r’ is

r Where

=

Σxy

Nσxσy

 ); y = (Y –  Y) x = (X – X x = Standard deviation of X series y = Standard deviation of Y series N = Number of pairs of observations r = Correlation coefficient (product moment correlation)

This formula is applied only where deviations of items are taken from actual mean and not from assumed mean Range of coefficient of correlation: The value always lies between -1 to +1. r

=

+1, means perfect positive correlation between variables

r

=

-1, means perfect negative correlation between variables

r

=

0, means no relationship between variables

The correlation coefficient gives both magnitude (value) and direction of correlation. It is a measure of covariance between two series. The covariance of above two series x & y is covariance =  xy / N; thus above formula for ‘r’ can be modified to easier form as:

r = r = Where,

Σxy

Nσxσy

N√

Σxy

Σx2 Σy2 X N N

x = √ =

Σx2 N

, y = √

Σy2 N

Σxy

√Σx2 +Σy2

 ); y = (Y –  x = (X – X Y)

Procedure for calculation of correlation coefficient: It involves following steps: 1. Take deviations of X series from mean of X (X) and denote these deviations by ‘x’. 2. Square the deviations (x2) and take the sum or total of square of deviations i.e. x2. 3. Take deviations of Y series from mean of Y(Y) and denote these deviations by ‘y’. 4. Square the deviations (y2) and take the sum or total i.e. y2. 5. Multiply the deviations of X and Y series (x. y) and obtain the total i.e. xy. 6. Substitute the values of xy, x2 and y2 in above formula. Examples: Marks of 5 students in two subjects are given. Calculate correlation coefficient (r). Student Marks in AGB Marks in VPY

1 60 65

2 70 76

3 68 74

4 55 62

5 75 80

Direct method for calculating correlation coefficient: It involves calculating correlation coefficient (r) value directly from actual ‘X’ and ‘Y’ values rather than taking deviations of items from actual or assumed mean.

r =

NΣXY−(ΣX) (ΣY)

√NΣX2 −(ΣX)2√NΣY2 −(ΣY) 2

Assumptions of Pearson’s correlation coefficient: ‘r’ is based on following assumptions: 1. Linear relationship between the variables (when plotted on a scatter diagram, a straight line is formed by the points so plotted. 2. Two variables are affected by a large number of independent causes to form a normal distribution e.g. variables like age, height, weight, price, demand, supply etc. 3. There is a cause and effect relationship between forces affecting distribution of items in two series.

Merits of correlation coefficient: 1. Most popular mathematical method for measuring degree of relationship. 2. Measures both degree and direction of relationship in single figure (positive or negative). Limitations: 1. ‘r’ always assumes linear relationship between variables (whether correct or not). 2. Interpretation of the value of ‘r’ should be carefully done. 3. Its value is affected by extreme items. 4. Takes more time to compute the value of correlation coefficient.

Interpreting correlation coefficient: The ‘r’ measures degree of relationship between two sets of figures/variables: r = + 1 means perfect positive and

r = - 1 means perfect negative relationship

r = 0 mean no relationship. The closer is ‘r’ to +1 or -1, the closer is relationship between variables; Closer ‘r’ is to 0; less close is relationship between variables. The closeness of relationship is not proportional to ‘r’ means r = 0.8 does not mean relationship twice as close to r = 0.4, it is in fact much closer. Higher the value of ‘r’, the better is the estimate. Properties of correlation coefficient: 1. It lies between -1 to +1 (Range is -1 to +1 or -1  r  +1). 2. The ‘r’ is independent of change of scale (divide and multiply by some constant) and change of origin (subtracting some constant from every value) of variables X and Y. 3. The ‘r’ is the geometric mean of two regression coefficients.

r = √bxy X by.x 4. The degree of relationship between two variables is symmetrical.

rxy = ryx rxy =

Σxy

Nσxσy

Σyx

= Nσyσx = r yx

Probable error of coefficient of correlation: The probable error of ‘r’; determines the reliability of value of coefficient so far as it depends upon the condition of random sampling. The random error of the coefficient of correlation ‘r’ is estimated as PEr

=

N

=

1 - r2 N Number of pairs of observations 0.6745

r = Coefficient of correlation If the value of r is less than PE r ; the value is not at all significant, means no evidence of relationship/correlation. If the value of r is > PEr by six times, the coefficient of correlation is certain i.e. r is significant. By adding and subtracting the value of PEr from ‘r’, the upper and lower limits in which r is expected to lie in the population can be obtained. P = r  PEr Where P (Rh0) denotes correlation in population. Let us compute PEr of a ‘r’ of 0.80 with N = 16 (sample pair of items) PEr = 0.6745 1 - (0.80)2

= 0.06

16 The limits of correlation in population will be r  PEr = 0.80 0.06 or 0.74 - 0.86 If 0.6745 is omitted from formula of PEr; then we get SE of r SEr = 1 - r2 N Conditions for use of PE: PE can be used properly when following conditions exist: 1. The data must approximate a normal frequency curve (bell shaped curve). 2. The statistical measure for which PE is estimated must be calculated from sample. 3. The sample must be unbiased and individual items must be independent. Example: If r = 0.6 and N = 64; find out PEr and determine the limits of population r. PEr = 0.6745 1 - r2

=

0.6745 1 - (0.6)2 =

N

64

0.6745 x 0.64

= 0.054

8

Limit of population correlation = 0.60  0.054 = 0.546 - 0.654 Coefficient of determination: The Square of coefficient of correlation (r 2) is called coefficient of determination. It equals r 2. If value of r is 0.8 then r2 = 0.64; it means that 64% or 0.64 of the variation in the dependent variable is explained by independent variable. The maximum value of r2 is unity or 1, when all the variation in dependent variable (Y) is explained by independent variable(X). Coefficient of determination (r2) = Explained variation / Total variance Coefficient of non-determination (k2): k2 = 1 - r2 The ratio of unexplained variance to total variance is called coefficient of non-determination (k2) Coefficient of Non-determination (k2) = Unexplained variation / Total variance

Rank correlation coefficient: The method of finding co-variability between 2 variables when the population is not normally distributed was given by Charles Edwards Spearman (1904) and is called Rank correlation. Rank correlation ranks the observations according to size and

makes calculations on ranks rather than original observation/ value. Using ranks rather than actual value gives coefficient of rank correlation (R). Spearman’s rank correlation R = 1−

6ΣD2 N(N2 −1)

or

1-

6ΣD2

(N3 −N)

R = Rank correlation coefficient D = Difference of ranks between paired items in two series R1

D (R1 - R2) D2 .. .. .. ..  (D)  D2 The value of ‘R’ is interpreted same as ‘r’. It ranges from -1 to +1; when R= +1, there is R2

complete agreement in order of ranks and ranks are in same direction when R = -1, the ranks are in opposite direction with complete agreement. Characteristics of Spearman’s rank correlation: 1. The sum of difference of ranks between two variables (i.e.  D = 0) is zero. 2. It is non-parametric or distribution free without any assumption about the population from which observations are drawn. 3. Simply, it is Correlation coefficient between ranks and is interpreted in same manner. Calculation of Spearman’s rank correlation: (A) When ranks are given: The steps involved are: 1. Take difference of two ranks ‘D’ as (R1 – R2) 2. Square the differences and get total  D2 3. Apply formula R = 1− (B)

6ΣD2

N(N2 −1)

or

1-

6ΣD2

(N3 −N)

When Ranks are not given: When actual data is given but not ranks,the procedure involves first assigning ranks taking highest value as 1 rank and lowest as last rank (it can also be started from lowest as 1 rank but same patterns should be followed for both variables and calculate as first methods. Merits of Rank Correlation: 1. Simpler to understand and easier to calculate than Karl Pearson’s Correlation Coefficient. The answers will be same if all the items are different. 2. This method can be used in qualitative data like honesty, efficiency, intelligence etc. (like workers of two offices can be ranked for efficiency) 3. Only method when ranks are available but data is not available. 4. Rank correlation can be used when actual data is available.

Limitations of rank correlation: 1. It can be used for finding correlation in a grouped frequency distribution.

2. When number of items are >30, the calculations become tedious and hence should not be applied when N>30, unless only ranks are given and not actual data. Application of rank correlation: 1. Initial data is in form of ranks. 2. When N is small (...


Similar Free PDFs