Title | MH3511 Finals Cheatsheet |
---|---|
Course | Data Analysis w Comp |
Institution | Nanyang Technological University |
Pages | 2 |
File Size | 334.1 KB |
File Type | |
Total Downloads | 11 |
Total Views | 57 |
MH3511 Cheatsheet AY20/Chap1 – Intro to R > seq(2,10,2);seq(2,10,length=3) [1] 2 4 6 8 10; 2 6 10 > rep(2,3) = [1] 2 2 2 Vectors > matrix(data,byrow = F, nrow=value) > Matrix Multiplication: %*% > Add row/column: cbind()/rbind() > Remove row/colum: matrix_name[-r,-c] > Transpose...
MH3511 Cheatsheet AY20/21 Chap1 – Intro to R > seq(2,10,2);seq(2,10,length=3) [1] 2 4 6 8 10; 2 6 10 > rep(2,3) = [1] 2 2 2 Vectors > matrix(data,byrow = F, nrow=value) > Matrix Multiplication: %*% > Add row/column: cbind()/rbind() > Remove row/colum: matrix_name[-r,-c] > Transpose: t(); Inverse: solve(); Determinant: det() Data Frames > x=data.frame(data1,data2,..) > Column names: names(x) > Row names: row.names(x) > subset(x, condition1|condition2…) #subset > x[order(x$variable,decreasing=F),] > AND:”&”,OR:”|”,NOT:”!” Conditional Statements >if(condition){expr} else if(condition){expr} else(condition){expr} >while(condition){expr1;expr2…} #expr can be a if statement >for(var in seq){expr} #Var in seq: (elem in x),(i in 1:10) Functions > fun = function(var1,var2..){body} Chap2 – Describing Numerical Data Trimmed Mean: Removes a proportion of the largest and smallest observations and calculate the mean of remaining observations. >mean(data,trim=0.10) #10% from each end 1. Establish a relatively low SE of the data 2. Remove effects of possible outliers Summary: summary(data); > quantile(data,0.25) #25th Quantile Graphical Methods > Stem and Leaf: stem(data) > Histogram: hist(data,breaks=n) Imposing Normal PDF on histogram > xpt=seq(a,b,by=0.1) #(a,b) dependent on range > n_den=dnorm(xpt,mean(data),sd(data)) > ypt=n_den*length(data)*a #a is length of each bin > lines(xpt,ypt,col=”blue”) QQplot 1. Compare data with known distribution 2. Compare 2 samples 3. Compare 2 known distributions > qqnorm(data) > qqline(data,col=”blue”) Normal Distributed Data will be close to qqline Long/short left/right tail can be identified
Shapiro-Wilk Test > shapiro.test(data) H0: Data is normally distributed H1: Data is not normally distributed Boxplot
Chap4 – Categorical Data Binomial Dist X~B(n,p) 𝑋 Find 𝑝 = 𝑛
> b_pdf=dbinom(x,n,p) > exp_freq=N*b_pdf #N: total sample size
F-test for 𝜎12 = 𝜎22 against 𝜎12 ≠ 𝜎22 F* = S12/S22 ~ (n1-1, n2-1) H0: 𝜎12 = 𝜎22, H1: σ12 ≠ σ22 Reject H0 if: > var.test(data1,data2)
2 Way Contingency Tables E[i,j]=(Total row i / Total)*Column total j > boxplot(data) Outlier 1. |𝑥𝑖 − 𝑥 | > 2 × 𝑠𝑑 2. 𝑥𝑖 < 𝑞1 − 1.5𝐼𝑄𝑅 or 𝑥𝑖 > 𝑞3 + 1.5𝐼𝑄𝑅 R code: > abs(data-mean(data))>2*sd(data) or > (dataquantile(data,0.75)+1.5*IQR(data)) Chap3 – Statistical Inference 𝜎 Confidence interval: 𝑋 − 𝑧𝑎 . < 𝜇 < 𝑋 + 𝑧𝑎 . 2
√𝑛
2
Z0.05 = 1.645, Z0.025 = 1.96, Z0.01 = 2.326 Confidence interval: 𝑋 − 𝑡𝑎,𝑛−1. 𝑠 < 𝜇 < 𝑋 + 𝑡𝑎 ,𝑛−1. 2
√𝑛
2
t-test for 𝜇𝑎 = 𝜇𝑏 against 𝜇𝑎 ≠ 𝜇𝑏 1. Conduct F test to conclude difference in variance 2. Conduct appropriate t-test accordingly
H0: No association between row and col variables H1: Row and Col variables are associated > chisq.test(data)
> t.test(data1,data2,var.equal = T) #or F depending
Paired 2 Way Contingency Tables
Two Dependent Samples Sample: {(X11,X21),(X12,X22)..}
𝜎
Di = X1i-X2i,
√𝑛 𝑠 √𝑛
R code: > t.test(data, conf.level=0.9) #mu=0 default > t.test(data, conf.level=0.9,mu=a) #H1: true mean is not equal to a 𝜎 One sided CI: 𝜇 < 𝑋 + 𝑧𝑎 . √𝑛 > t.test(data, conf.level=0.9, alt=”less”) alt: “two.sided”(Default),”less”,”greater” Proportion
Chi-square test applied as above, v=1 Chi-square with correction for continuity correction: R code: > pattable = table(data$before,data$after) > mcnemar.test(pattable) H0: Treatment has not affected proportion of positives H1: Treatment has affected proportion of positives
> z=qnorm(1-a/2) If H0 holds true
, where v=n-1 -Paired t-test > t.test(data$var1,data$var2,mu=0,paired = T) or > d=data$var1 – data$var2 > t.test(d) #one sample t-test Multiple(>2) Samples
Assumptions: - All sample observations are independent - Population variances are the same
Chap5 – Multiple Samples Data When variance is unknown and known 𝜎12 = 𝜎22, 𝑆𝑝2 =
> F = qf(a/2,v1,v2) R code for Proportion: > prop.test(x,n,conf.level=0.90) #default correct=T Hypo testing for Proportion: > prop.test(x,n,p0,alt=””) #H1: true p is “” than p0
(𝑛1 − 1)𝑆12 + (𝑛2 − 1)𝑆22 , 𝑣 = 𝑛1 + 𝑛2 − 2 𝑛1 + 𝑛2 − 2
When variance is unknown and known 𝜎12 ≠ 𝜎22,
F* = Sb2/Sw2, p-value = 1-pf(F*,k-1,n-k) H0: There is no difference between “factors” for “variable” H1: Difference between “factors” for “variable” exist
> friedman.test(var~treatment|block, data=datafrm) or > friedman.test(datafrm$var,datafrm$treatment, + datafrm$block)
ANOVA Model > aov(data$var~factor(data$factor)) > summary(aov(data$var~factor(data$factor))) If H0 rejected, further conduct pairwise comparison > pairwise.t.test(data$variable,data$factor, + p.adjust.method = “none”) #Signf. p values denote difference between factors Chap6 – NonParametric Test Quantile Test 1. Rank observations 2. Let t1 be the number of observations ≤ xp 3. Let t2 be the number of observations < xp
Chap7 – Correlation and Regression - Correlation coefficients, 𝜌 =
𝐶𝑜𝑣(𝑋,𝑌 )
√𝑣𝑎𝑟(𝑋) ∗𝑣𝑎𝑟(𝑌)
p-value = 1-pchisq(T,number of factors -1) > kruskal.test(data$var,data$factor) Two Dependent Samples
Prediction
- Hypothesis Testing (H0: p=0) 1+|𝑟 |
p-value = 2*pt(-t,df) or 2*(1-pf(f,df1,df2)) >cor.test(x,y) Where n is the number of observations, Y~B(n,p). R code: > 2*pbinorm(t1,n,p) or 2*(1-pbinorm(t2-1,n,p))
- Hypothesis Testing (H0: p=p0)
Two Independent Samples - T~B(n,0.5), T0 = n(+ chosen) p-value = Pr(T≥T0|T~B(n,0.5)) > 1-pbinorm(T0-1,n,0.5) or > binom.test(T0,n,0.5,alternative=”greater”) When n is large, > prop.test(T0,n,p=0.5,alternative=”greater”) Wilcoxon signed ranks test Within each pair, let Di = Yi – Xi. Let Ri be the rank of |Di| and the NEGATIVE rank is assigned when Di negative.
pvalue = 2*pnorm(T) Multiple Independent Samples Kruskal-Wallies Test
> wilcox.test(y,x,paired=T,correct=T) #or F Multiple Dependent Samples
^Rank all observations
, df=n-2. CI: Estimating mean response
*Measures strength and direction of linear relationship
t* = r/Sr, df = n-2 or F = 1−|𝑟|, df = (n-2,n-2)
Wilcoxon Rank Sum Test H0: Pr(X wilcox.test(data1,data2) or > wilcox.test(variable~factor, data=my_data, exact=F)
, df=n-2. CI:
1. Rank across row i then sum down column j (Rj)
> pvalue = 2*(1-pnorm(Z)) Confidence Interval: 𝑧 ± 𝑍0.025 × 𝜎𝑧 = (𝑧𝑙 , 𝑧𝑢 )
Simple Linear Regression
R Code: > model = lm(Y~X) > summary(model) > confint(model,level=0.95) > newx = data.frame(x=c(a,b,c..)) > pred_conf = predict(model,newx,level=0.95, interval=”confidence”) > pred = predict(model,newx,level=0.95, interval=”prediction”)...