MH3511 Finals Cheatsheet PDF

Title	MH3511 Finals Cheatsheet
Course	Data Analysis w Comp
Institution	Nanyang Technological University
Pages	2
File Size	334.1 KB
File Type	PDF
Total Downloads	11
Total Views	57

Preview

CLICK TO PREVIEW PDF

Summary

MH3511 Cheatsheet AY20/Chap1 – Intro to R > seq(2,10,2);seq(2,10,length=3) [1] 2 4 6 8 10; 2 6 10 > rep(2,3) = [1] 2 2 2 Vectors > matrix(data,byrow = F, nrow=value) > Matrix Multiplication: %*% > Add row/column: cbind()/rbind() > Remove row/colum: matrix_name[-r,-c] > Transpose...

Description

MH3511 Cheatsheet AY20/21 Chap1 – Intro to R > seq(2,10,2);seq(2,10,length=3) [1] 2 4 6 8 10; 2 6 10 > rep(2,3) = [1] 2 2 2 Vectors > matrix(data,byrow = F, nrow=value) > Matrix Multiplication: %*% > Add row/column: cbind()/rbind() > Remove row/colum: matrix_name[-r,-c] > Transpose: t(); Inverse: solve(); Determinant: det() Data Frames > x=data.frame(data1,data2,..) > Column names: names(x) > Row names: row.names(x) > subset(x, condition1|condition2…) #subset > x[order(x$variable,decreasing=F),] > AND:”&”,OR:”|”,NOT:”!” Conditional Statements >if(condition){expr} else if(condition){expr} else(condition){expr} >while(condition){expr1;expr2…} #expr can be a if statement >for(var in seq){expr} #Var in seq: (elem in x),(i in 1:10) Functions > fun = function(var1,var2..){body} Chap2 – Describing Numerical Data Trimmed Mean: Removes a proportion of the largest and smallest observations and calculate the mean of remaining observations. >mean(data,trim=0.10) #10% from each end 1. Establish a relatively low SE of the data 2. Remove effects of possible outliers Summary: summary(data); > quantile(data,0.25) #25th Quantile Graphical Methods > Stem and Leaf: stem(data) > Histogram: hist(data,breaks=n) Imposing Normal PDF on histogram > xpt=seq(a,b,by=0.1) #(a,b) dependent on range > n_den=dnorm(xpt,mean(data),sd(data)) > ypt=n_den*length(data)*a #a is length of each bin > lines(xpt,ypt,col=”blue”) QQplot 1. Compare data with known distribution 2. Compare 2 samples 3. Compare 2 known distributions > qqnorm(data) > qqline(data,col=”blue”) Normal Distributed Data will be close to qqline Long/short left/right tail can be identified

Shapiro-Wilk Test > shapiro.test(data) H0: Data is normally distributed H1: Data is not normally distributed Boxplot

Chap4 – Categorical Data Binomial Dist X~B(n,p) 𝑋 Find 𝑝 = 𝑛

> b_pdf=dbinom(x,n,p) > exp_freq=N*b_pdf #N: total sample size

F-test for 𝜎12 = 𝜎22 against 𝜎12 ≠ 𝜎22 F* = S12/S22 ~ (n1-1, n2-1) H0: 𝜎12 = 𝜎22, H1: σ12 ≠ σ22 Reject H0 if: > var.test(data1,data2)

2 Way Contingency Tables E[i,j]=(Total row i / Total)*Column total j > boxplot(data) Outlier 1. |𝑥𝑖 − 𝑥 | > 2 × 𝑠𝑑 2. 𝑥𝑖 < 𝑞1 − 1.5𝐼𝑄𝑅 or 𝑥𝑖 > 𝑞3 + 1.5𝐼𝑄𝑅 R code: > abs(data-mean(data))>2*sd(data) or > (dataquantile(data,0.75)+1.5*IQR(data)) Chap3 – Statistical Inference 𝜎 Confidence interval: 𝑋 − 𝑧𝑎 . < 𝜇 < 𝑋 + 𝑧𝑎 . 2

√𝑛

2

Z0.05 = 1.645, Z0.025 = 1.96, Z0.01 = 2.326 Confidence interval: 𝑋 − 𝑡𝑎,𝑛−1. 𝑠 < 𝜇 < 𝑋 + 𝑡𝑎 ,𝑛−1. 2

√𝑛

2

t-test for 𝜇𝑎 = 𝜇𝑏 against 𝜇𝑎 ≠ 𝜇𝑏 1. Conduct F test to conclude difference in variance 2. Conduct appropriate t-test accordingly

H0: No association between row and col variables H1: Row and Col variables are associated > chisq.test(data)

> t.test(data1,data2,var.equal = T) #or F depending

Paired 2 Way Contingency Tables

Two Dependent Samples Sample: {(X11,X21),(X12,X22)..}

𝜎

Di = X1i-X2i,

√𝑛 𝑠 √𝑛

R code: > t.test(data, conf.level=0.9) #mu=0 default > t.test(data, conf.level=0.9,mu=a) #H1: true mean is not equal to a 𝜎 One sided CI: 𝜇 < 𝑋 + 𝑧𝑎 . √𝑛 > t.test(data, conf.level=0.9, alt=”less”) alt: “two.sided”(Default),”less”,”greater” Proportion

Chi-square test applied as above, v=1 Chi-square with correction for continuity correction: R code: > pattable = table(data$before,data$after) > mcnemar.test(pattable) H0: Treatment has not affected proportion of positives H1: Treatment has affected proportion of positives

> z=qnorm(1-a/2) If H0 holds true

, where v=n-1 -Paired t-test > t.test(data$var1,data$var2,mu=0,paired = T) or > d=data$var1 – data$var2 > t.test(d) #one sample t-test Multiple(>2) Samples

Assumptions: - All sample observations are independent - Population variances are the same

Chap5 – Multiple Samples Data When variance is unknown and known 𝜎12 = 𝜎22, 𝑆𝑝2 =

> F = qf(a/2,v1,v2) R code for Proportion: > prop.test(x,n,conf.level=0.90) #default correct=T Hypo testing for Proportion: > prop.test(x,n,p0,alt=””) #H1: true p is “” than p0

(𝑛1 − 1)𝑆12 + (𝑛2 − 1)𝑆22 , 𝑣 = 𝑛1 + 𝑛2 − 2 𝑛1 + 𝑛2 − 2

When variance is unknown and known 𝜎12 ≠ 𝜎22,

F* = Sb2/Sw2, p-value = 1-pf(F*,k-1,n-k) H0: There is no difference between “factors” for “variable” H1: Difference between “factors” for “variable” exist

> friedman.test(var~treatment|block, data=datafrm) or > friedman.test(datafrm$var,datafrm$treatment, + datafrm$block)

ANOVA Model > aov(data$var~factor(data$factor)) > summary(aov(data$var~factor(data$factor))) If H0 rejected, further conduct pairwise comparison > pairwise.t.test(data$variable,data$factor, + p.adjust.method = “none”) #Signf. p values denote difference between factors Chap6 – NonParametric Test Quantile Test 1. Rank observations 2. Let t1 be the number of observations ≤ xp 3. Let t2 be the number of observations < xp

Chap7 – Correlation and Regression - Correlation coefficients, 𝜌 =

𝐶𝑜𝑣(𝑋,𝑌 )

√𝑣𝑎𝑟(𝑋) ∗𝑣𝑎𝑟(𝑌)

p-value = 1-pchisq(T,number of factors -1) > kruskal.test(data$var,data$factor) Two Dependent Samples

Prediction

- Hypothesis Testing (H0: p=0) 1+|𝑟 |

p-value = 2*pt(-t,df) or 2*(1-pf(f,df1,df2)) >cor.test(x,y) Where n is the number of observations, Y~B(n,p). R code: > 2*pbinorm(t1,n,p) or 2*(1-pbinorm(t2-1,n,p))

- Hypothesis Testing (H0: p=p0)

Two Independent Samples - T~B(n,0.5), T0 = n(+ chosen) p-value = Pr(T≥T0|T~B(n,0.5)) > 1-pbinorm(T0-1,n,0.5) or > binom.test(T0,n,0.5,alternative=”greater”) When n is large, > prop.test(T0,n,p=0.5,alternative=”greater”) Wilcoxon signed ranks test Within each pair, let Di = Yi – Xi. Let Ri be the rank of |Di| and the NEGATIVE rank is assigned when Di negative.

pvalue = 2*pnorm(T) Multiple Independent Samples Kruskal-Wallies Test

> wilcox.test(y,x,paired=T,correct=T) #or F Multiple Dependent Samples

^Rank all observations

, df=n-2. CI: Estimating mean response

*Measures strength and direction of linear relationship

t* = r/Sr, df = n-2 or F = 1−|𝑟|, df = (n-2,n-2)

Wilcoxon Rank Sum Test H0: Pr(X wilcox.test(data1,data2) or > wilcox.test(variable~factor, data=my_data, exact=F)

, df=n-2. CI:

1. Rank across row i then sum down column j (Rj)

> pvalue = 2*(1-pnorm(Z)) Confidence Interval: 𝑧 ± 𝑍0.025 × 𝜎𝑧 = (𝑧𝑙 , 𝑧𝑢 )

Simple Linear Regression

R Code: > model = lm(Y~X) > summary(model) > confint(model,level=0.95) > newx = data.frame(x=c(a,b,c..)) > pred_conf = predict(model,newx,level=0.95, interval=”confidence”) > pred = predict(model,newx,level=0.95, interval=”prediction”)...