Pr2s - Prof John Kent PDF

Title	Pr2s - Prof John Kent
Course	Multivariate Analysis
Institution	University of Leeds
Pages	9
File Size	177.8 KB
File Type	PDF
Total Downloads	28
Total Views	134

Preview

CLICK TO PREVIEW PDF

Summary

Prof John Kent...

Description

MATH 3772 Solution to Practical Page 1

Math 3772 Solution to Practical The task was to investigate the fox data for the US and UK to see if there was any statistical difference between them. The R code at the end gives more detailed calculations and results to augment the highlights given below.

Exploratory Analysis Various plots can be used to look at the distribution of the data, both within and between the two countries. Plots such as histograms, boxplots, and qqplots can be used to look at the univariate distribution within each country. Figure 1 gives a histogram for each of the 3 variables and each of the two countries.

36

40

44

48

4 3 2 0 65

70

75

80

85

7.4

F2

7.8

8.2

8.6

F3

36

40

44

48

6 4 0

2

Frequency

8 0

4

Frequency

6 4 2 0

Frequency

8

F1

1

Frequency

6 4 0

2

Frequency

3 2 1 0

Frequency

F3

8

F2

4

F1

55 60 65 70 75 80

8.0

8.5

9.0

Figure 1: Histograms for the variables F1, F2, F3 for the US foxes (top line) and the UK foxes (bottom line)

But in some ways a more useful plot is the pairs plot, which gives a picture of both the univariate and bivariate distributions. Figure 2 shows the pairs plot for the combined dataset (i.e.US plus UK), with an extra column added to label the country (1=US, 2=UK) (“jittered” to distinguish ties) and with different plotting characters (and colors when

MATH 3772 Solution to Practical Page 2

65

75

1.0

1.4

1.8

46

55

75

38

42

F1

8.5

9.0

55

65

F2

1.8

7.5

8.0

F3

1.0

1.4

species.jit

38

42

46

7.5

8.0

8.5

9.0

Figure 2: Pairs plot for joint fox data set: US (label=1, plotted as circles)and UK (label=2, plotted as triangles).

viewed on the computer) used for the two countries. The main conclusions are (a) Univariate normality. From the histograms and from the final row/column of Figure 2, we conclude that all the histograms seem compatible with normality. [Even though not all the histograms are not exactly “bell-shaped”, bear in mind that the sample sizes are not very large.] (b) Bivariate normality. Similarly, all the bivariate plots in Figure 2 suggest that the

MATH 3772 Solution to Practical Page 3 data within each country are compatible with bivariate normality. For each pair of variables and each country, the plots suggest an elliptic pattern of concentration. (c) Univariate comparisons between countries (from the final row/column of Figure 2). (i) For F1, (height), the spreads (e.g. standard deviation or range) are similar for the two countries and the means are similar. Indeed the two countries seem indistinguishable. (ii) For F2 (length), the spreads (e.g. standard deviation or range) are similar for the two countries and the US mean is higher than the UK mean. There is some overlap between the two distributions, but the means look significantly different. (iii) For F3 (weight), the spreads (e.g. standard deviation or range) are similar for the two countries and the US mean is lower than the UK mean. The is more overlap between the two distributions than for F2, but the means still look significantly different. (d) Bivariate comparisons between countries (backed up by numerical values for the correlations in the R output). (i) For F1 vs. F2, there appears to be small positive correlation between the two variables within each country. The two clusters of points are well-separated, mainly due to F2. (ii) For F1 vs. F3, there seems to be a positive correlation between the two variables within each country, though the correlation looks considerably stronger for the US than for the UK. The two groups (US and UK) can be reasonably wellseparated by a suitable 45o line. (iii) For F2 vs. F3, there again seems to be a positive correlation between the two variables within each country; this time the correlation looks stronger for the UK foxes. In this case the the two countries can be separated nearly perfectly by a suitable 45o line. (iv) However, given the small sample sizes, the differences between the correlations for the two countries does not seem very important. In particular, the assumption of a common covariance matrix for the two countries seems a reasonable assumption, at least as an approximation.

T 2 test The two-sample T 2 statistic is given by n1 n2 T −1 d S d = 88.45 T2 = n1 + n2

MATH 3772 Solution to Practical Page 4 where d = x1 − x2 is the 3-dimensional difference of the mean vectors at the two countries and S=

(n1 − 1)S1 + (n2 − 1)S2 n1 + n2 − 2

is the pooled estimate of the covariance matrix. Under the null hypothesis of no difference between the means of the two countries, T 2 ∼ T 2 (p, n1 + n2 − 2) = T 2 (3, 33) with upper 5% critical value T 2 (3, 33; 95%) = 9.30. Since 88.4 >> 9.30 (or in terms of the corresponding F (3, 31) statistic and critical value, 27.70 >> 2.91), we strongly reject the null hypothesis at the 5% level. (Alternatively we can calculate the p-value, which is extremely significant at 6.6 × 10−9 . To try to understand why we have rejected the null hypothesis, we look at the simultaneous confidence intervals, which for a linear combination a takes the form r n1 + n2 2 T a d± T (3, 33; 95%) aT Sa. n1 n2 For the simple linear combinations a = (1, 0, 0)T , (0, 1, 0)T , (0, 0, 1)T , the simultaneous 95% SCIs are (−2.55, 3.36), (3.73, 13.00), (−0.895, −0.112), respectively. Note the first interval includes 0, whereas the latter two variables exclude 0. That is the difference between the two countries is due to variables F2 (for which the US has a higher mean than the UK) and F3 (for which the US has a lower mean than the UK).

Re-scaling If A is a nonsingular 3 × 3 matrix and we linearly transform the data from x to Ax (so that the pooled sample covariance matrix is transformed from S to ASAT , then the T 2 statistic remains unchanged since (Ad)T (ASAT )−1 (Ad) = dT AT A−T S −1 A−1 Ad = dS −1 d, where A−T is shorthand for (AT )−1 = (A−1 )T . Measuring F3 in pounds instead of Kg corresponds to A = diag(1, 1, 2.2) (there are 2.2 pounds in one kilogram). Hence the T 2 statistic is unchanged under changes to the units of measurement.

Known variable of interest If it is known (or suspected) ahead of time that it is variable F1 where the differences between the countries are expected, we can use the ordinary univariate t-test instead of

MATH 3772 Solution to Practical Page 5 the multivariate T 2 test. If there really is a difference between the countries on F1 (and little or no difference on F2 and F3), then the univariate t-test will be more powerful than the multivariate T 2 test because the corresponding confidence interval the mean of F1 will be shorter. The t-statistic takes the value r √ n1 n2 d1 / s11 = 0.418, t= n1 + n2

t2 = 0.715.

The two-sided upper 5% critical value of the t33 distribution is 2.03, so we do not reject the null hypothesis in this case. That is, the ecologist seems to have been misguided in his preconceptions. It is also of interest to compare this univariate test (based on t2 ) to the multivariate test above (based on T 2 ). (a) The sample value t2 = 0.715 must be smaller than T 2 = 88.4 because we have not maximized over the choice of linear combination a but have instead used a pre-specified choice a = (1, 0, 0)T . (b) The critical value of {t(33, 97.5%)}2 = 2.032 = 4.14 = F (1, 33, 95%) is considerably smaller than the critical value of T 2 (T 2 (3, 33, 95%) = 9.30) because we do not need to allow for maximizing over a, or equivalently, to allow for multiple comparisons. (c) The difference in critical values means that the 95% confidence interval for the mean difference between the US and UK on F1 is much narrower when it is an interval based on the t-distribution, (-1.57, 2.38), than it is when constructed as a simultaneous interval based on Hotelling’s T 2 distribution, (-2.55, 3.36). But, as we have already observed, since even the narrower interval includes 0, the null hypothesis is still not rejected even if we knew F1 was the variable where the difference was expected to lie.

Difference between T 2 and t2 Recall from the notes that for a linear combination a, the t2 statistic based on the linear combination aT x is given by ta2 =

n1 n2 {aT (x1 − x2 )}2 /aT S a, n1 + n2

and T 2 is the maximum of ta2 as a varies over all non-zero linear combinations a. In particular, T 2 = 88.5 must be greater than or equal to the t2 statistics for each of the

MATH 3772 Solution to Practical Page 6 coordinate variables, i.e. a = (1, 0, 0)T , a = (0, 1, 0)T , and a = (0, 0, 1)T . For the fox data set T 2 = 88.5 and the three t2 statistics are 0.175, 30.32, 15.39. However, numerically, T 2 is nearly 3 times the largest of the t2 statistics, which seems unusually large. To understand why, it is helpful to look at the bivariate plot between F2 and F3 in Figure 2. The difference between the bivariate means lies roughly on a line of slope -45o whereas the two variables are positively correlated (so the major axis lies on a line with slope +45o ). Further, the two countries can nearly be perfectly separated in this bivariate plot, whereas in the two one-dimensional plots there is appreciable overlap between them. In other words, the two countries look considerably more different from one another in this two-dimensional plot than in either one-dimensional plot. The leaf data (Example 6.2 in Section 6.3.3 of the notes) showed similar behavior. A similar, but less extreme story, can be told from the bivariate plot of F1 vs. F3. The combinations of effects from F1 vs F3 and F2 vs F3 seems to explain why T 2 is so large. However, the fact that T 2 is nearly 3 times the largest of the t2 statistics is not related to the fact that the dimension of the problem is p = 3.

R code # function to express quantiles and p-values for T2 in terms of F pT2=function(quant,p,m) pf(quant*(m-p+1)/(m*p),p,m-p+1) qT2=function(prob,p,m) (m*p/(m-p+1))*qf(prob,p,m-p+1) # function to compute simultaneous confidence intervals for # the rows of the matrix A sci=function(dif,S,n1,n2,A=diag(nrow(dif)),alpha=0.95,simultaneous=TRUE) { # two-sample SCIs based on rows of A if(is.matrix(A) & ncol(A)==1) A=t(A) if(is.vector(A)) A=matrix(A,nrow=1) k=nrow(A); p=ncol(A); out=matrix(0,nrow=k,4) colnames(out)= c("mid","p/m","lo","hi") for(i in 1:k) { a=A[i,]; out[i,1]=sum(dif*a) vr=t(a)%*%S%*%a; sd=sqrt(vr) if(simultaneous==TRUE) crit=qT2(alpha,p,n1+n2-2) if(simultaneous==FALSE) crit=qT2(alpha,1,n1+n2-2) out[i,2]=sqrt(vr*crit*((n1+n2)/(n1*n2))) out[i,3]=out[i,1]-out[i,2]; out[i,4]=out[i,1]+out[i,2] } out }

MATH 3772 Solution to Practical Page 7 xus=read.table("http://www1.maths.leeds.ac.uk/~john/3772/USFox.txt", header=TRUE) xuk=read.table("http://www1.maths.leeds.ac.uk/~john/3772/BritishFox.txt", header=TRUE) xus=as.matrix(xus); xuk=as.matrix(xuk) #histogram plots par(mfcol=c(2,3)) for(i in 1:3) { hist(xus[,i],main=paste("F",i,sep=""),xlab="") hist(xuk[,i],main=paste("F",i,sep=""),xlab="") } nus=nrow(xus); nuk=nrow(xuk) n1=nus; n2=nuk m=n1+n2-2; p=3 cat("sample sizes: nus=n1=",n1," nuk=n2=",n2,"\n") species=c(rep(1,nus),rep(2,nuk)) species.jit=species+.01*rnorm(nus+nuk) # jittered species x=cbind(rbind(xus,xuk)) xplus=cbind(x,species.jit) #pairs plot par(mfrow=c(1,1)); pairs(xplus,col=species, pch=species) x1bar=t(xus)%*%rep(1/n1,n1); x2bar=t(xuk)%*%rep(1/n2,n2) dif=x1bar-x2bar mean.summary=cbind(x1bar,x2bar,dif) colnames(mean.summary)=c("US","UK","difference") cat("sample means and their difference in 3 columns\n") print(mean.summary) X1c=xus-rep(1,n1)%*%t(x1bar); X2c=xuk-rep(1,n2)%*%t(x2bar) V1=t(X1c)%*%X1c; V2=t(X2c)%*%X2c; S=(V1+V2)/(n1+n2-2) S1=V1/(n1-1); S2=V2/(n2-1) R1=cor(xus); R2=cor(xuk) # corr matrices cat("correlation matrix from US\n"); print(R1) cat("correlation matrix from UK\n"); print(R2) T2= ((n1*n2)/(n1+n2))*t(dif)%*%solve(S,dif) cat("T2=",T2," F=",T2*(m-p+1)/(m*p),"\n") T2.95=qT2(0.95,p,n1+n2-2); F.95=qf(0.95,p,m-p+1) cat("T2.95=",T2.95," F.95=",F.95,"\n")

MATH 3772 Solution to Practical Page 8 cat("p-value for T2 = ",1-pT2(T2,p,n1+n2-2),"\n") cat("p,m,m-p+1=",c(p,m,m-p+1),"\n") # t-tests tt=rep(0,3); t2=rep(0,3); ttpval=rep(0,3) for(i in 1:3) { tt[i]=sqrt((n1*n2)/(n1+n2))*dif[i]/sqrt(S[i,i]); t2[i]=tt[i]^2 ttpval[i]=2*(1-pt(abs(tt[i]),n1+n2-2)) } cat("t-stats, squared t-stats, and two-sided p-vals\n") print(cbind(tt,t2,ttpval)) tt.95=qt(.975,n1+n2-2); t2.95=tt.95^2 cat("2-sided 5% critical t value = ",tt.95, " squared crit value = ",t2.95,"\n") out.sci=sci(dif,S,n1,n2,A=diag(p),alpha=0.95, simultaneous=TRUE) cat("SCIs\n"); print(out.sci) out.ci=sci(dif,S,n1,n2,A=diag(p),alpha=0.95, simultaneous=FALSE) cat("single CIs\n"); print(out.ci)

Selected R output sample sizes: nus=n1= 15 nuk=n2= 20 sample means and their difference in 3 columns US UK difference F1 42.400000 41.995 0.4050000 F2 71.600000 63.235 8.3650000 F3 8.066667 8.570 -0.5033333 correlation matrix from US F1 F2 F3 F1 1.0000000 0.2247467 0.6997753 F2 0.2247467 1.0000000 0.2715030 F3 0.6997753 0.2715030 1.0000000 correlation matrix from UK F1 F2 F3 F1 1.0000000 0.4574467 0.1747894 F2 0.4574467 1.0000000 0.6244950 F3 0.1747894 0.6244950 1.0000000 T2= 88.44891 F= 27.69612

MATH 3772 Solution to Practical Page 9 T2.95= 9.297486 F.95= 2.911334 p-value for T2 = 6.648963e-09 p,m,m-p+1= 3 33 31 t-stats, squared t-stats, and two-sided p-vals tt t2 ttpval [1,] 0.4181947 0.1748868 6.785137e-01 [2,] 5.5066089 30.3227413 4.134267e-06 [3,] -3.9234116 15.3931589 4.179428e-04 2-sided 5% critical t value = 2.034515 squared crit value = SCIs mid p/m lo hi [1,] 0.4050000 2.9529713 -2.5479713 3.3579713 [2,] 8.3650000 4.6319566 3.7330434 12.9969566 [3,] -0.5033333 0.3911781 -0.8945115 -0.1121552 single CIs mid p/m lo hi [1,] 0.4050000 1.9703229 -1.5653229 2.375323 [2,] 8.3650000 3.0905991 5.2744009 11.455599 [3,] -0.5033333 0.2610074 -0.7643407 -0.242326

4.139252...