Title | Stats 101 Cheatsheet |
---|---|
Course | Introduction to statistic |
Institution | Singapore Management University |
Pages | 2 |
File Size | 371.9 KB |
File Type | |
Total Downloads | 47 |
Total Views | 134 |
Stats 101 Cheatsheet...
CATEGORICAL/QUALITATIVE/DEP (Y) √ observed (index, blood pressu readings), X measured Nominal (X order- sex) VS Ordinal (√ order- months) UNIVARIATE CATEGORICAL (1 VARIABLE) 1) Bar Charts, 2) Pie Charts, 360◦ 3) Pareto Diagram (many categories) a. Bars arranged highest lowest b. Cumulative percentage polygon (line) Description: Highest, 2nd highest, lowest BIVARIATE CATEGORICAL (2 VARIABLES) 1) Side by Side Bar Charts HYPO TEST FOR CATS a. Easier to compare when all values v. similar
NUMERICAL/QUANTITATIVE/INDEP (X) √ measured (height, weight, area, temp) Discrete (√ countable) VS Continuous (X countable) UNIVARIATE NUMERICAL (1 VARIABLE) 1) Histogram X = Number line cannot rearrange bars Meaningful intervals with equal widths Freq Table: find min, max, range first Class Bin Freq 𝑟𝑎𝑛𝑔𝑒 Upper limit n = 10 Width = 𝑛
2) • • • •
• •
9.5 – 12.0 12.0 Uniform: Median = Mean Right-skewed/+ve tailed: Median < Mean Left-skewed/-ve tailed: Median > Mean Box and Whisker Plot / 5-number summary Arithmetic 𝜮𝒙 Mean: 𝒙= 𝒏
Median: Q2 (50%); lowest to highest mid value Mode: most freq (can be >1) Range = Max - Min
(𝜮𝒙)𝟐 *SD is non-negative!! 𝜮𝒙𝟐 − 𝒏 SD = s = reasonable variation = √
𝒏−𝟏
Interquartile Range = Q3 (75%) - Q1 (25%) The middle 50% of X has an average diff of (IQR).
BIVARIATE NUMERICAL (2 VARIABLES) - relationship #1: Find 𝛴𝑥 ; 𝛴𝑦 ; 𝛴𝑥 2 ; 𝛴𝑦 2 ; 𝛴𝑥𝑦 ; 𝑛 draw table #2: let X = ____ (units) ; let Y = ___ (units) #3: Linear Correlation Coefficient/multiple r (4 dp) (𝜮𝒙)(𝜮𝒚) [𝜮𝒙𝒚 − ] *Correlation 𝒏 𝒓= does NOT imply causation!!
√[𝜮𝒙𝟐 − (𝜮𝒙) ] [𝜮𝒚 𝟐 − (𝜮𝒚) ] 𝒏 𝒏 𝟐
𝟐
Intervening factors: ↑ go out, ↑ accidents #4: Interpret r/slope (“trend” / properties) number r close to -1 V. strong -ve linear correlation btwn X & Y 0.0-0.2 V. weak, > negligible rs, 0=X sig correlated 0.2-0.4 Weak/moderate r/s
0.4-0.6 Fairly +ve r/s 0.6-0.8 Strong +ve r/s 0.8-1.0 V. strong +ve linear correlation btwn X&Y The scatterplot/line of best fit/regression line/normal prob plot shows a strong +/-/negligible r/s btwn X & Y, with a good degree of linearity/strong normality of data (X major deviation from plotted points). We can conclude that there is a strong +/- correlation btwn X & Y. The longer the X, the longer Y. ∴, we are @ liberty to apply least sqs mtd to find eqn of the regression line. #5: Coefficient of Determination, 𝐫 𝟐 (%/proportion) Since 𝑟 2 = __% , __% of the variation in 𝑦 is explained by the variability of 𝑥. The remaining (100 - __) = __ % of the variability in 𝑦 is due to factors other than what the linear regression model can explain. i.e. the variability in predictions this model yields is ↓ by 𝑟 2 %. #6: Linear Regression Eqn (predict): 𝒚 = 𝒃𝟏 𝑿 + 𝒃𝟎
#7: Gradient: 𝒃𝟏
=
(𝜮𝒙)(𝜮𝒚) ] 𝒏 (𝜮𝒙)𝟐
[𝜮𝒙𝒚−
[𝜮𝒙𝟐 − 𝒏 ] *regression coefficients
#8: y-intercept: 𝒃𝟎 = ( ) − 𝒃𝟏 ( 𝜮𝒚 𝒏
𝜮𝒙 𝒏
*n = sample pairs X = √ control Y = X control (dep)
)
#9: write linear eqn: pulse in bpm = -3.5 (time) + 4.5 #10: The sample regression slope, 𝑏1 , represents the estimated expected ↑/↓ in 𝑦 per unit ↑ in 𝑥. In context, for every (unit) ↑/↓ in 𝑥, 𝑦 drops by 𝑏1.
C H A N C E/ P R OB A B I L I T Y Classical √ assumption Qn: X numbers P(2B,1G) = (*BBG, BGB, GBB) 1 1 1
1 3
*BBG = x x = ( ) 2 2 2
Empirical X assumption, “survey” Qn: √ numbers P(2B,1G) =
2
89 232
PASCAL’S ∆: 𝑛 2 𝐶 = 10 combi / ways to select 2 out of n coins MONTY HALL PARADOX: switch to x2 chances of win 1 2 (𝑚𝑦 𝑐ℎ𝑜𝑖𝑐𝑒) 𝑉𝑆 (𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔) 3 3 SIMPSON’S/ONE PROBABILITY PARADOX: Reversal of effect materialises when 1 factor is omitted/included. 𝑥 is a lurking variable, one which may exert dramatic influences if omitted from a study, cos an association may look quite diff after adjusting for the effect of this 3rd variable by grouping the data according to its values. ∴, we cannot summarily state that __, but rather the reverse. GENERAL ADDITION RULE:
P(A or B) = P(A∪B) = P(A) + P(B) – P(A∩B)
MULTIPLICATION RULE (TEST FOR INDEPENDENCE): • Outcome of 1 event DOES NOT affect probability of another event occurring multiply! #1: P(A and B) = (A∩B) = #2: P(A) X P(B) =
35 70
X
35
70
23+33 70
OR *P(A|B) OR P(A)
= √ indep ≠ dep
BAYES’ RULE:
*𝐏(𝐀|𝐁) =
𝐏(𝐁|𝐀)𝐏(𝑨)
𝐏(𝐁|𝐀)𝐏(𝐀)+𝐏(𝐁|𝐀′)𝐏(𝐀′)
𝐏(𝐀∩𝐁)
OR
*P(A’) = 1 – P(A) OR P(win) = 1–P(lose) *P(F’|W’) = cannot find if/given not there = 1 *P(>2 share birthdays) = 1 – P (no one shares) =1–(
365
165
𝐏(𝐁) P(B|A)P(𝐴)
x 365 x … 365) 364
316
or
P(B)
CONTINUOUS DISTRIBUTIONS “CHANCE” NORMAL DIST (pop) / SAMPLE NORM DIST X count (height, weight, time, temp) • Bell VS well curve (retail, hotel, income- avg @ bottom) • How to det normal behv? See Scatter Plot. #1:
DISCRETE DISTRIBUTIONS “CHANCE”
#2: #3:
√ count (number of…)
DISCRETE “expected winnings, true/weighted avg” #1: let X = player’s winnings in $ ; m = amt paid to play #2: Outcome X ($) Prob xP 1 *E(X) # hits 1 win – 1 amt to play •
•
Mean = E(X) *For a fair game, E(X) = 0 “Expected Winnings” = win – lose = – $100 “Expected Loss” = + $100 opp signs!! SD(X) = √𝑬(𝒙𝟐 ) − [𝑬(𝒙)]𝟐
BINOMIAL 2 outcomes only! head/tail, Y/N, M/F p=0.5 P(success) + P(failure) = 1 CONSIDER IF X CAN BE -VE! Mean = E(X) = 𝒏𝒑 x = 0 to n (√ limit) SD(X) = √𝒏𝒑(𝟏 − 𝒑) full dp! p = prob of success let X = number of (qn) 1 - p = prob of failure X~B(n,p) n = ∞ p = 100% 𝒏 #3: P(0≤ X ≤1) = ( ) 𝒑𝒙 (𝟏 − 𝒑)𝒏−𝒙 𝑥 𝒙 • • • #1: #2:
𝒙
( )=1 ( )=𝒙 1 0 𝑥0 = 1
POISSON units of measure “per__/in__yrs”, ∞ X limits **∆ if necessary!! Mean = E(X) = 𝝀/unit • SD(X) = √𝝀
•
#1: let X = number of … every 10 mins/year #2: X ~ 𝝋 (𝝀) #3: P(X≥3) = 1 – P(X = 0, 1, 2) =1–
𝒆−𝝀𝝀𝒙 𝒙!
OR
𝝀𝒙 𝒆−𝝀 [ 𝟎!
**Poisson (discrete) VS Exponential (continuous; T btwn…)
𝝀𝒙
+ 𝟏! +
𝝀𝒙 𝟐!
]
APPROX BINOMIAL POISSON “using suit approx” #1: let X = number of (qn) #2: X ~ B (n , p)
#3: Since n = __ is v. large and p = 𝒆−𝝀 𝝀𝒙 𝒙!
is v. small, 4000
𝜆𝑥
𝜎
√𝑛
= __ , n = __
let X = ____ *might have to draw table Since n ≥ 30, nPs ≥ 5 and n(1-Ps) ≥ 5, by CLT, 𝝈 𝟐 √𝒏
X ~ N (𝝁, 𝝈𝟐 ) OR 𝒙 ~ N (𝝁, ( ) ) 𝒙 −𝝁
𝑥−𝜇 P(0≤ 𝑥 ≤ 14) = 𝑃 ( 𝜎 < 𝒁 < 𝝈 / 𝝈 ) ) 2 dp ( 𝒙−𝝁
√𝒏
** ≈ 0.5 – 0.5 (if too far off graph) ≈ 0.0000
If X 𝜎 P(-2𝜎 < X – 𝝁 < 2𝜎) = P(-2 < Z < 2) = see table
**“__% fall within/beyond 2SD above norm/mean?”
#5: Although population was not described as normal, √ CLT came into effect as √ 𝝈 was known and random samples n=36 > 30, which means 𝒙 appox N dist. OR A2) CI of 𝜇 if X 𝜎 is unknown 0.0001% chance that getting at least as extreme a result as this. If mean were ___g, the result is more likely (not) due to random variation/by chance. Rather, mean was likely to be below ___g.
SAMPLE PROPORTION, Ps √ count! Ps = Mean SD
Population 𝝅 = true proportion 𝛔= true variance
𝒙
𝒏
(0 ≤ Ps ≤ 1) Sample E(Ps) = 𝝅 (when dk 𝜋, use Ps) *might have to draw table 𝐬=√
𝝅(𝟏 − 𝝅) 𝒏
#1: 𝜋/Ps = __ (%) , n = __ #2: let Ps = sample proportion of ___ #3: Since n𝜋/Ps = __≥ 5 , n(1- 𝜋/Ps) = __ ≥ 5, by CLT,
Ps ~ N ( 𝝅 , (√
#4: P (Ps > 0.74) = P(𝒛
>
𝝅(𝟏−𝝅) 𝒏
𝑷𝒔−𝝅
√
𝟐
) )
) *if qn didn’t state,
𝝅(𝟏−𝝅) 𝒏
0.74 > 0.5, so >
#5: __% chance of … (>/< 5%? see table for descrip)
PROBABILITY/P-VALUE
1
X ~ 𝝋 (𝝀) **𝝀 = 𝒏𝒑 = mean #3: P(X≤2) = P(X=0,1,2)
≈
#4:
𝜇 /𝑥 ∗ = __ , 𝜎/
𝜆𝑥
OR 𝑒 −𝜆 [ 0! + 1! +
𝜆𝑥
2!
] 𝑥0 = 1 0! = 1
*KEYWORDS: within/no more than/@most ≤, no less ≥ *SMALL N: skewed distribution which violates the necessary assumption of normality.
>/< lowest sig lvl commonly employed? p-value < 5% or 1% reject H0 p-value > 5% or 1% Stat SIG diff esp life & death Statistically INsig Small, v. rare occurrence Likely occurrence √ statistically unusual, extreme X statistically unusual X due to random √ due to random variation/chance variation/chance Beyond reasonable doubt Very common
UNIFORM “symmetrical/rectangle/equal probability” intervals of equal length (alarm every 5 min)
#1: #2: #3: #4:
𝒂+𝒃
𝟏𝟐 (𝒃−𝒂) Mean = E(X) = , SD(X) = √ let X = ____ 𝟐 X ~ Uni ( a , b ) P(X < 20) = length x height (area)
#7: Test: P (Z >/< #4) = P (Z > 2 dp) = p-value 4dp < 𝛼? 𝛼 P (𝑍 > |#4|) = p-value < ? (2-tailed) 𝛼 2 𝛼 ) H0 #8: ∴ We do not (> 𝛼/ ) / reject (< 𝛼/ 2 2 #9: We do not / have enough evidence, at __ % sig lvl, to say that H0 / H1 .
𝟐
𝜶 sig lvl (2-tailed) C.I. 100 (1-𝜶) 𝒁𝜶
#5: Limit = (lower [mean - SD] , upper [mean + SD])
EXPONENTIAL “waiting time (T)” #1: Mean = E(X) = SD(X) = E(T) =
𝟏
𝝀
#4: *P(T > arrival time) = 𝒆 *if P(T ≤ 10) = 1 – P(T > 10) = 1 – 𝒆−𝝀𝒕 −𝝀𝒕
FIND ING n, e, C. I. , SI G L VL 𝜶 Since n ≥ 30 / nPs ≥ 5 and n(1-Ps) ≥ 5, by CLT, 𝑥 /𝑃𝑠 ~ N,
𝒏=
(𝒁𝜶)𝟐 𝐗 𝟐 𝒆𝟐
𝝈𝟐
𝒆 = (𝒁𝜶 𝟐
𝝈
√𝒏
0.10 90% 1.645
𝟐
“mean T btwn…”
𝝀 = poisson avg “mean/avg per unit” #2: let T = time between ____ in sec/mins/hours #3: T ~ exp (𝝀)
MEAN
#6: Under H0 , ___
) “within ±5”
*If σ unknown, estimate from past studies/surveys
PROPORTION, Ps
true %, √ count (study, vote, poll) (𝒁𝜶 )𝟐 𝐱 𝑷𝒔(𝟏 − 𝑷𝒔) 𝑷𝒔(𝟏 − 𝑷𝒔) 𝟐 𝒏= ) 𝐞 = (𝒁𝜶 √ 𝒏 𝟐 𝒆𝟐
*If Ps unknown, assume Ps = 0.5 (max value) ↑e, ↓n = allow/tolerate more error for less work e has more effect on error than on CI, bcos 𝑒 2 X guilty ≠ innocent X guilty = not, not innocent (insuff evi to prove innoce)
𝑥 = ___, 𝝈 / s = __ , n = __ let 𝜇 = true mean ___ #1: H0 : 𝜇 ≤/≥/= __ “at least 0.4” H1 : 𝝁 < 0.4 #2: H1 : 𝜇 >//< z ) < 𝛼 OR if P ( Z > |𝑧| ) <
OR Reject H0 if t >/< t𝛼 ; n-1 = critical t (see table!!) Reject H0 if |𝑡 | > 𝑡𝛼 ; n-1 = see table (2-tailed) 2
𝛼
2
0.01 99% 2.576
• C.I. of 𝝁 when √ 𝝈 known (pop)/ X 𝝈 UNknown (s) let 𝜇 = true mean ___ Since n ≥ 30, by CLT, 𝑥 ~ normal, Confident coefficient
± 𝐙𝛂 a __% C.I. * for 𝝁 = (𝒙
𝛔
𝟐 √𝐧
) = (– , +) units 4dp
Assuming 𝜇 is not too skewed & since 𝜎 is unknown, a __% C.I. * for 𝝁 = (𝒙 ± (𝒕𝜶;𝒏−𝟏 ) ( 𝟐
𝑺
√𝒏
))
** Conclusion: draw intervals! 1) We are __% confident that the true mean of (𝜇 ) lies btwn __ and __ (units). 2) Valid claim? 0 falls within CI = X sig diff btwn X & Y 3) X overlap Since the lower confi limit of the prop of X exceeds the upper limit of that of Y, the true mean X is significantly bigger than that of Y, with __% confidence. 4) √ overlap Since X & Y overlap, cannot claim that 1 (has more) no sig diff btwn 2 5) lower Y = upper X equal chances of winning 6) “> chance can explain?” lower limit exceeds 50% 7) C.I. = net that encompasses 𝜇
ONE-SAMPLE HYPO TESTS “SUFFICIENT EVI?” MEANS • √ 𝝈 known (pop) OR X 𝝈 UNknown (sample)
0.02 98% 2.326
0.05 95% 1.96
Evi of a diff btwn true PROP of __
PROPORTIONS (Ps) 1 SAMPLE / 2 SAMPLE
Ps = 6 of 10, once every 7 days, n=_ OR 𝑃𝑠1, 𝑃𝑠2, 𝑛1, 𝑛2 𝑥 +𝑥 let 𝜋 = true %/proportion of ___ 𝜋 = 𝑛1 +𝑛2 1
2
OR H0 : 𝜋1 = 𝜋2 #1: H0 : 𝜋 = __ #2: H1 : 𝜋 >/30, / Assuming 𝜇 not too skewed,
𝒁=
(𝒙 𝟏 −𝒙 𝟐 )−(𝝁𝟏 −𝝁𝟐 ) √
𝝈𝟏 𝟐 𝝈𝟐 𝟐 + 𝒏𝟏 𝒏𝟐
∼ N (0, (0,1) 1)
𝟏 − 𝒙𝟐 ) ± 𝒁∝√ • __% C.I.* for 𝝁𝟏 − 𝝁𝟐 = ((𝒙 𝟐
•
𝝈𝟏 𝟐 𝒏𝟏
+
𝝈𝟐 𝟐 𝒏𝟐
X 𝝈𝟏 & 𝝈𝟐 UNknown (F test T tests) F-test: “test equality/assumption of variances”
)
let 𝜎1= true variance of ___ in (units) ; 𝜎2 =___ #1: H0 : 𝜎1 2 = 𝜎2 2 not UNequal #2: H1 : 𝜎1 2 ≠ 𝜎2 2 not equal, sig diff #3: 𝛼 = 0.05 (2-tailed since ≠, X direction)
1)
#4: Assuming 𝜎 are approx. N, 𝑭 = #5: Reject H0 if F > 𝐹∝ ; 𝑛 2
𝑺𝟏 𝟐 ~ 𝑺𝟐 𝟐
𝑭𝒏𝟏 −𝟏,𝒏𝟐−𝟏
= critical F (see table!) 1 −1,𝑛2 −1
2) T-tests: “2 better” = lesser error let 𝜇1= true mean (error) of __ in (units) ; 𝜇2 = _ …#3: 𝛼 = 0.05 (1 tailed) *See F-test 𝛼 1/2 tailed? ÷2? #4: Since 𝑛1= 𝑛2 >30,CLT, /Assuming 𝜇 not too skewed, & since 𝜎1 and 𝜎2 are unknown & not UNequal/equal, do not reject; 𝒔𝒑 𝟐; pooled SD/not UNequal variance 𝟏 −𝒙 𝟐)−(𝝁𝟏−𝝁𝟐) (𝒙 𝟏 𝟏 + ) 𝒏𝟏 𝒏𝟐
√𝑺𝒑 𝟐(
~ 𝒕𝒏𝟏+ 𝒏𝟐−𝟐
reject; 𝒔𝟏 𝟐, 𝒔𝟐 ; separate SD/not equal variance
𝒕=
𝟐 )−(𝝁𝟏 −𝝁𝟐 ) (𝒙𝟏 −𝒙 𝟐 𝑺 𝟐 𝟐 𝒏𝟐
𝑺 √ 𝒏𝟏 + 𝟏
~ 𝒕𝒅.𝒇. (
#5: Reject H0 if t > 𝑡∝ ;𝑛1+ 𝑛2 −2 = critical t (4dp) table!! 2 **|2-tailed| #6: Under H0 , 𝜇 = 𝜇 1
#7: Test:
𝒔𝒑 𝟐 =
2
𝑺𝟏𝟐(𝒏𝟏 −𝟏)+ 𝑺𝟐𝟐(𝒏𝟐−𝟏) (𝒏𝟏+𝒏𝟐−𝟐)
#4 = __ >/< #5? (4dp)
• C.I. of diff 𝝁 when X 𝝈𝟏 & 𝝈𝟐 UNknown a __% C.I.* for 𝝁𝟏 − 𝝁𝟐 = *Previous 𝛼 1/2 tailed? ((𝒙𝟏 − 𝒙𝟐 ) ± 𝒕∝; 𝒏 𝟐
√𝒔𝒑 𝟐 (
𝟏 + 𝒏𝟐 −𝟐
𝟏/𝒔𝟏 𝟐 𝒏𝟏
+
𝜮𝑫
𝜮𝑫𝟐
…#4: … 𝒕 =
•
−𝝁𝑫 𝑫 𝒔𝑫 ) √𝒏𝑫
(
~ 𝒕 𝒏−𝟏
*refer to ONE SAMPLE HYPO TEST!
C.I when X 𝝈𝑫 UNknown “C.I. of true mean diff” let 𝝁𝑫 = true mean “X-Y” difference in units. ± 𝒕∝ ; 𝒏 −𝟏 ( 𝑫 )) a __% C.I.* for 𝝁𝑫 = (𝑫 𝑫 √𝒏𝑫 𝟐 𝒔
Significant F #6: Under H0 , 𝜎1 = 𝜎2 #7: Test: F = #4 = p-value (4 dp) >/< #5? #8: ∴ We do not reject (< #5) / reject (> #5) H0 #9: We do not/have enough evi, at % sig lvl, to say true variances btwn 𝜎1 & 𝜎2 are H0 / H1 (see red #1 , #2)
𝒕=
𝒏𝑫
Find 𝛴𝐷 ; 𝛴𝐷2 ; 𝑛𝐷 ; 𝐷 ; 𝑠𝐷 *Assume X-Y (higher) = -ve let 𝝁𝑫 = true mean “X-Y” difference in units.
𝟏/𝒔𝟐 )) 𝒏𝟐 𝟐
HYPO TESTS FOR CATS
“SIG LV L, EVI”
𝝌𝟐 GOODNESS OF FIT TEST (1 variable)
#1: H0 : ___ ok/fine/unbiased “sum of diff” #2: H1: ___ NOT ok/NOT fine/biased “disobey law” …#4.1: (𝑶 − 𝑬)𝟐 Outco O Prob E 𝑬 mes (A/F) =Prob x Y 18 Red Cal “A” 38/37 18 38/37
Black Green Total
Y
2 38
/
Cal “B”
1
37
Cal “C” Y
1
𝝌𝟐 = ∑
(𝑶 − 𝑬)𝟐 𝑬
e.g. Roulette: American (38 +2G) VS French/EU (37 +1G)
𝝌𝟐 TEST OF INDEPENDENCE (2 variables)
#1: H0 : X & Y are independent, i.e. no r/s evi of r/s #2: H1: X & Y dep, i.e. ∃ r/s X direction, only related …#4.1: 1) Add on to qn: Row TT, Column TT, Grand TT 2) c1 c2 (𝑂 − 𝐸)2 (𝑂 − 𝐸)2 O *E O *E 𝐸
𝐸
r1 “A” “C” “A” “C” r2 0 “B” = “B” “B” “D” cTT Q Q #7: “X” W W #7: “Y” *E = (r TT X c TT) / grand TT (n) #4.2: Since each E > 5 , & sample size is large, 𝝌𝟐 = ∑
(𝑶 − 𝑬)𝟐 ~𝝌𝟐(𝒓−𝟏)(𝒄−𝟏) 𝑬
2 =critical 𝜒 2 (see table!!) #5: Reject H0 if 𝜒 2 > 𝜒(𝑟−1)(𝑐−1) #6: Under H0 , independence
#7: Test: 𝝌𝟐 = ∑ (𝑶−𝑬) 𝑬
𝟐
= “X” + “Y” in table >/< #5?...