STATISTICS (30001) PDF

Title STATISTICS (30001)
Course Statistica / Statistics
Institution Università Commerciale Luigi Bocconi
Pages 13
File Size 449 KB
File Type PDF
Total Downloads 267
Total Views 445

Summary

DESCRIPTIVE STATISTICSVARIABLE TYPESCategorical nominal : non-numerical, non-rankable Categorical ordinal : non-numerical, rankable Numerical discrete : countable Numerical continuous : infiniteMEASURING CENTRAL TENDENCYMode (for categorical nominal values, or few values) The value with the highest ...


Description

DESCRIPTIVE STATISTICS VARIABLE TYPES Categorical nominal: non-numerical, non-rankable Categorical ordinal: non-numerical, rankable Numerical discrete: countable Numerical continuous: infinite MEASURING CENTRAL TENDENCY Mode (for categorical nominal values, or few values) The value with the highest frequency or count is the mode, visible in the frequency distribution table Median (for categorical ordinal or numerical values) Splits the sample in half. • from raw data: if n is ODD, median = (n+1)/2 if n is EVEN, median is either value or the average of both • from cumulative frequency: the median is either equal to 0.5, or the first value above 0.5 • from a histogram: look at the relative frequency of each bar, same as with cumulative frequency Mean (for numerical values only) ! "! "⋯"!# The arithmetic average: 𝑥" = ! "$ The deviation is the difference between value and mean, it is positive if x > mean and negative if x < mean Outliers are extremely high or low values with very low frequency: they affect the mean and not the median MEASURES OF LOCATION (used for categorical ordinal and numerical values) Quartiles: similar to the median, there are four quartiles: • Q1, first (lower) quartile, it is the first value bigger than or equal to 0.25 • Q2, median, it is the first value bigger than or equal to 0.5 • Q3, third (upper) quartile, it is the first value bigger than or equal to 0.75 Percentiles are similar to quartiles, but with a percentage that one chooses. Five number summary: minimum, first quartile, median, third quartile, maximum MEASURES OF VARIABILITY (used for numerical values) Range: 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 − 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 Interquartile range: 𝑄% − 𝑄&

Variance: the variability around the mean using deviation 𝑠' =

Standard deviation: the square root of the variance 𝑠 = ,

(!! )!*)" "⋯"(!# )!* )" $)&

(!! )!* )" "⋯"(!# )!* )" $)&

Variables that have the same mean can be compared with the use of variance and standard deviation, but , without the same mean, they can only be compared with the use of the coefficient of variation CV =

|!*|

TOOLS TO ANALYZE VARIABLES Frequency distribution table Value Count Relative freq. Cumulative freq. A xA xA /n % xA /n B xB xB /n %xA /n+ %xB /n

Pie chart

Bar chart

Box plot

SHAPE Symmetric bell-shape

Symmetric u-shape

𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑚𝑒𝑎𝑛 𝑄& − min =4 max −4𝑄% 4 𝑚𝑒𝑑𝑖𝑎𝑛 − 𝑄& = 𝑄% − 𝑚𝑒𝑑𝑖𝑎𝑛 𝑚𝑒𝑑𝑖𝑎𝑛 − 𝑄& < 𝑄& − 𝑚𝑖𝑛𝑖𝑚𝑢𝑚

𝑚𝑒𝑑𝑖𝑎𝑛 − 𝑄& > 𝑄& − 𝑚𝑖𝑛𝑖𝑚𝑢𝑚

Asymmetric right-skewed

Asymmetric left-skewed

𝑚𝑒𝑑𝑖𝑎𝑛 < 𝑚𝑒𝑎𝑛 𝑄& − min 𝑚𝑒𝑎𝑛 𝑄& − min >4 max −4𝑄% 𝑚𝑒𝑑𝑖𝑎𝑛 − 𝑄& > 𝑄% − 𝑚𝑒𝑑𝑖𝑎𝑛

CONCENTRATION CURVE There are two extreme situations of concentration: perfect equality which is when all values have the same count, and maximum equality which is when one value has all counts and the others have none. Variable X with observations ranked in ascending order x1, …, xn ! "! "⋯"! . The coordinates are (Fi,Qi) where 𝐹. = $ 4𝑎𝑛𝑑4𝑄. = ! !"! ""⋯"! $ !

"

#

F0 = Q0 = 0 and Fn = Qn = 1 The closer to the bisector, closer to perfect concentration, the lower the concentration The closer to the x-axis, closer to maximum concentration, the higher the concentration

CONCENTRATION INDEXES The maximum concentration area is found by:

($)&) $



&

'

Gini’s index (R): concentration area divided by the maximum concentration area (𝐹& − 𝑄& ) + ⋯ + (𝐹$)& − 𝑄$)& ) 𝑅= 𝐹& + 𝐹' + ⋯ + 𝐹$)& Properties: 0 < R < 1, R = 0 at perfect concentration and R = 1 at maximum concentration Pietra’s index (P): 𝑃4 = ∗ max4(𝐹. − 𝑄. ) $)& Properties: P = 0 at perfect concentration and P = 1 at maximum concentration $

ASSOCIATION PREMISE Association is when two variables vary in a systematic way: we aim at looking for dependence where the dependent variable is the one we want to explain, and the independent variable is the one used for providing the explanation. BOTH VARIABLES ARE CATEGORICAL We use a cross tab where the independent variable splits the sample in sub-samples. We work out the frequency distribution of the dependent variable in each group and compare them, known as conditional /0.$1234561.7428349:4$;< relative distribution ( =63>.$652?.,13.@:1.0$ )

There is association if the conditional frequencies are different, and no association if the conditional frequencies are similar. It can be represented by means of a stacked bar plot or a side by side bar plot. DEPENDENT CATEGORICAL AND INDEPENDENT NUMERICAL We use synthetic measures such as the mean: there is association if the means are different, and no association if the means are similar. We can also compare boxplots, and the same is said for boxplots (similar boxplot means no association, different boxplots means association). BOTH VARIABLES ARE NUMERICAL There are three ways of assessing the dependence between two numerical variables: Scatterplot: split in four quadrants. • positive linear association: high values of one variable occur with high variables of the other • negative linear association: high values of one variable occur with low variables of the other ! # # Covariance (Cov): 𝐶𝑜𝑣(𝑋, 𝑌) = ! $)& • a positive covariance means points are in I and III, and there is a positive linear association • a negative covariance means points are in II and IV, and there is a negative linear association

(! )!* )(< ) q⍺ ) LINEAR TRANSFORMATION For any random variable X and 2 numbers a and b: Y is the linear transformation of X so that 𝑌 = 𝑎 + 𝑏𝑋 𝐸(𝑌) = 𝑎 + 𝑏𝐸 (𝑋) 𝑉𝑎𝑟(𝑌) = b' 𝑉𝑎𝑟(𝑋) STANDARDIZATION C)I A linear transformation where X becomes Z so that 𝑍 = J 𝐸(𝑍) = 0 𝑉𝑎𝑟(𝑍 ) = 1 BERNOULLI DISTRIBUTION With notation 𝑋~𝐵𝑒𝑟(𝑝) it is an experiment with 2 outcomes: success p with value 1, and failure 1-p with value 0. 𝐸 (𝑋) = 𝑝 𝑉𝑎𝑟(𝑋) = p(1 − p) NORMAL (GAUSSIAN) DISTRIBUTION With notation 𝑋~𝑁(𝜇, 𝜎 ',) µ is the median and expected value. The higher µ, the more to the right is the curve. The higher σ2, the flatter and wider the curve. The three sigma rule is, there is a probability p of: • 0.68 of observing a value in the interval [𝜇 − 𝜎4444𝜇 + 𝜎4]4 • 0.95 of observing a value in the interval [𝜇 − 2𝜎4444𝜇 + 2𝜎4] • 0.99 of observing a value in the interval [𝜇 − 3𝜎4444𝜇 + 3𝜎4] STANDARD NORMAL DISTRIBUTION With notation 𝑍~𝑁(0,1), with µ=0 and σ2=1, the three sigma rule is, that there is a probability p of: • 0.68 of observing a value in the interval [-1 1] • 0.95 of observing a value in the interval [-2 2] • 0.99 of observing a value in the interval [-3 3]

JOINT PROBABILISTIC BEHAVIOR Let there be two variable X and Y which takes values x1,…,xk and y1,…,yk respectively, with probability 𝑃𝑟 = a𝑋 = 𝑥. , 𝑌 = 𝑦c. / We can represent them in a cross tab with probabilities. Two variables are independent if 𝑗𝑜𝑖𝑛𝑡4𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = Pr(𝑋) ∗ Pr4(𝑌)

We use Pearson’s correlation index 𝜌 which is -1 < 𝜌 < 1, and the closer to 1 or -1, the stronger the linear association. If = 0 there is no correlation and there is independence. If > 0 there is a positive linear association and if < 0 there is a negative linear association. SUM OF RANDOM VARIABLES If both are correlated:

𝐸(𝑋 + 𝑌) = 4 𝜇! + 𝜇< 𝑉𝑎𝑟(𝑋 + 𝑌) = 𝜎'! + 𝜎 𝑧⍺ √$

𝜇\ + 𝑧N

J √$

is also known as the critical value, and

which we reject the hypothesis).

CA)] ' √#

√#

> 𝑧⍺ is the rejection region (set of all values for

i < 𝜇\ − 𝑧N of 𝑯𝟎 : 𝝁 ≤ 4 𝝁𝟎 against the alternative 𝑯𝟏 : 𝝁 > 4 𝝁𝟎 rejects H0 if 𝑋 𝜇\ −

J 𝑧N $ √

is the critical value and

A )] C ' √#

< −𝑧⍺ is the rejection region.

J

√$

or if

A)] C ' √ #

< −𝑧⍺

TWO-SIDED TEST ON THE MEAN OF A NORMAL POP WITH KNOWN VARIANCE ' known variance 𝜎 ' (𝑋& , … , )𝑖. 𝑋$ 𝑖. 𝑑. ~𝑁(𝜇, 𝜎)44with A test with significance level ⍺ of 𝑯𝟎 : 𝝁 = 4 𝝁𝟎 against 𝑯𝟏 : 𝝁 ≠ 4 𝝁𝟎 rejects H0 if J J |CA )] | 𝑋i > 𝜇\ + 𝑧) or 𝑋i < 𝜇\ − 𝑧) or ' 2 > 𝑧⍺2 " √$

" √$

"

√#

TEST STATISTIC Test statistic: the function of the sample that we use for taking our decision whether to reject or not the null hypothesis. It summarizes the evidence against the null hypothesis. TEST ON MEAN OF NORMAL POP WITH KNOWN VARINACE ( 𝑋& , … , )𝑋𝑖.$ 𝑖. 𝑑. ~𝑁 (𝜇, 𝜎)'44with known variance 𝜎 ' CA )] Any test on 𝜇 is solved based on ' has assuming that 𝜇 = 4 𝜇\ has a known standard normal distribution. 𝐻\ : 𝜇 = 4 𝜇\ against 𝐻& : 𝜇 > 4 𝜇\ 𝐻\ : 𝜇 ≤ 4 𝜇\ against 𝐻& : 𝜇 > 4 𝜇\ 𝐻\ : 𝜇 = 4 𝜇\ against 𝐻& : 𝜇 ≠ 4 𝜇\

√#

P-VALUE p-value is a probability of observing a value of the test statistic that is more extreme than the observed one and this probability is computed assuming that the null hypothesis is true. More extreme needs to be considered in the direction of the alternative hypothesis. P-VALUE IN THE ONE-SIDED TEST ' known variance 𝜎 ' (𝑋& , … , )𝑖. 𝑋$ 𝑖. 𝑑. ~𝑁(𝜇, 𝜎)44with ! * )]2 is the observed value of the test statistic ' 𝑯𝟎 : 𝝁 = 4 𝝁𝟎 against 𝑯𝟏 : 𝝁 > 4 𝝁𝟎 √#

(standard normal distribution)

𝐏𝐫4(

A)] C 2 ' √#

>

; 𝑎𝑠𝑠𝑢𝑚𝑖𝑛𝑔4𝐻\ : µ = µ\ 4𝑖𝑠4𝑡ℎ𝑒4𝑝 − 𝑣𝑎𝑙𝑢𝑒

! * )]2 ' √#

p-value is the value assuming that the null hypothesis is true, probability to observe a value of the test statistic is less plausible than the observe one. Decision rule: for a fixed significance level ⍺, we reject the null hypothesis if the p-value is smaller than ⍺. COMPUTING P-VALUE

𝐻\ : 𝜇 = 4 𝜇\ against 𝐻& : 𝜇 > 4 𝜇\ 𝐻\ : 𝜇 = 4 𝜇\ against 𝐻& : 𝜇 < 4 𝜇\ 𝐻\ : 𝜇 = 4 𝜇\ against 𝐻& : 𝜇 ≠ 4 𝜇\

p-value = 𝐏𝐫4( p-value = 𝐏𝐫4( p-value = 𝐏𝐫4(

CA)]2 ' √#

CA)]2 '

√#

> <

AA A)] | |C 2

= 2 𝐏𝐫4(

' √#

' √#

' √#

; 𝑎𝑠𝑠𝑢𝑚𝑖𝑛𝑔4𝐻\ : µ = µ\ )

! *)]2 '

>

AA A )] | |C 2

; 𝑎𝑠𝑠𝑢𝑚𝑖𝑛𝑔4𝐻\ : µ = µ\ )

!*)]2

√#

|!*)]2|

>

'

√#

; 𝑎𝑠𝑠𝑢𝑚𝑖𝑛𝑔4𝐻\ : µ = µ\ )

|! *)]2| ' √#

; 𝑎𝑠𝑠𝑢𝑚𝑖𝑛𝑔4𝐻\ : µ = µ\ )

TESTS ON MEAN NORMAL POP AND UNKNOWN VARIANCE Any tests on the population mean is based on its estimator, the sample mean. ' When the variance is not known: ( 𝑋& , … , 𝑋 )𝑖. $ 𝑖. 𝑑. ~𝑁(𝜇, 𝜎 )44with mean and variance unknown. J" J 𝐸(𝑋i) = 𝜇 𝑉𝑎𝑟(𝑋i) = 𝑠𝑒(𝑋i) = √$

$

Under the null hypothesis, if 𝜇 = 4 𝜇\ (under the null hypothesis), then 𝐸(𝑋i) = 𝜇\ therefore,

A )]2 C , √#

has a

student’s t distribution with (n-1) degrees of freedom.

For any test on the population mean, the test statistic is now:

A )]2 C , √#

The difference from when we had a known variance are that: the standard error is estimated, and no longer known, the quantile of student’s T is used instead of standard normal distribution. ONE-SIDED TEST ON THE MEAN OF A NORMAL POP WITH KNOWN VARIANCE ' (𝑋& , … , )𝑖. 𝑋$ 𝑖. 𝑑. ~𝑁(𝜇, 𝜎)44 with mean and variance unknown. A test with significance level alpha: A)] U C of 𝑯𝟎 : 𝝁 = 4 𝝁𝟎 against 𝑯𝟏 : 𝝁 > 4 𝝁𝟎 rejects H0 if 𝑋i > 𝜇\ + 𝑡($)&),⍺ or if , > 𝑡($)&), ⍺ √$ √#

𝜇\ + 𝑡($)&),⍺

U

√$

is also known as the critical value, and

values for which we reject the hypothesis). p-value: Pr4(

A )] C , √#

> if4

CA )] , √#

> 𝑡($)&), ⍺is the rejection region (set of all

; 𝐻\ = µ = µ\ )

CA)] , √#

The probability is computed using the Student’s t distribution with (n-1) degrees of freedom. We reject if the p-value is smaller than alpha. i < 𝜇\ + 𝑡($)&),⍺ of 𝑯𝟎 : 𝝁 ≤ 4 𝝁𝟎 against the alternative 𝑯𝟏 : 𝝁 > 4 𝝁𝟎 rejects H0 if 𝑋

p-value: Pr4(

AC)] , √#

< if4

; 𝐻\ = µ = µ\ )

CA)] ,

√#

U

√$

or if

A )] C ,

√#

< −𝑡($)&), ⍺

The probability is computed using the Student’s t distribution with (n-1) degrees of freedom.

TWO-SIDED TEST ON THE MEAN OF A NORMAL POP WITH KNOWN VARIANCE ' mean and variance unknown. (𝑋& , … , )𝑖. 𝑋$ 𝑖. 𝑑. ~𝑁(𝜇, 𝜎)44with A test with significance level alpha: 𝑯𝟎 : 𝝁 = 4 𝝁𝟎 against 𝑯𝟏 : 𝝁 ≠ 4 𝝁𝟎 rejects H0 if U J |CA)] | 𝑋i > 𝜇\ + 𝑡($)&),) or 𝑋i < 𝜇\ − 𝑡($)&),) or ' 2 > 𝑡 ($)& ),) " √$

P-value: Pr t

|CA)]| , √#

>4

" √$

|CA)]| , √#

"

√#

; 𝐻\ = µ = µ\u = 2Pr t

|CA)]| , √#

>4

| CA)]| , √#

; 𝐻\ = µ = µ\u

TEST ON THE MEAN WITH AN UNKNOWN POPULATION Tests on the population mean is based on the corresponding estimator that is the sample mean. In the ' case of an unknown population: ( 𝑋& , … , )𝑖. 𝑋$𝑖. 𝑑. ~𝑁 (𝜇, 𝜎 )44 with mean and variance: 𝐸(𝑋i) = 𝜇

J 𝑉𝑎𝑟(𝑋i) =

"

$

𝑠𝑒(𝑋i) =

J

√$

Under the null hypothesis, if µ = µ\ 44𝐸(𝑋i) = µ\

For a large n this object has a distribution that we can approximate by a standard normal distribution. A )] C For any test on the population mean, the test statistic is , 2 √#

Same mathematical structure of the test for the case of a normal population with unknown variance, but the difference is that the quantile of the student’s t distribution is replaced by a quantile of standard normal distribution. ONE-SIDED TEST ON THE MEAN OF A NORMAL POP WITH KNOWN VARIANCE ' mean and variance unknown. (𝑋& , … , )𝑖. 𝑋$ 𝑖. 𝑑. ~𝑁(𝜇, 𝜎)44with A test with significance level alpha: U CA)] of 𝑯𝟎 : 𝝁 = 4 𝝁𝟎 against 𝑯𝟏 : 𝝁 > 4 𝝁𝟎 rejects H0 if 𝑋i > 𝜇\ + 𝑧⍺ $ or if , 2 > 𝑧⍺

p-value: Pr4(

A )] C 2 ,

√#

> if4

CA )]2 ,

√#



; 𝐻\ = µ = µ\ )

√#

The probability is computed using the standard normal distribution since assuming true the null hypothesis the test statistics has distribution that can be approximated by a standard normal distribution. i < 𝜇\ + 𝑧⍺ of 𝑯𝟎 : 𝝁 ≤ 4 𝝁𝟎 against the alternative 𝑯𝟏 : 𝝁 > 4 𝝁𝟎 rejects H0 if 𝑋

p-value: Pr4(

AC)] , √#

< if4

CA)] , √#

; 𝐻\ = µ = µ\ )

U

√$

or if

CA )] ,

√#

< −𝑧⍺

TWO-SIDED TEST ON THE MEAN OF A NORMAL POP WITH KNOWN VARIANCE ( 𝑋& , … , )𝑋𝑖.$ 𝑖. 𝑑. ~𝑁 (𝜇, 𝜎)'44with mean and variance unknown. A test with significance level alpha: 𝑯𝟎 : 𝝁 = 4 𝝁𝟎 against 𝑯𝟏 : 𝝁 ≠ 4 𝝁𝟎 rejects H0 if U U |CA)] | or 𝑋i < 𝜇\ − 𝑧) or ' 2 > 𝑧) 𝑋i > 𝜇\ + 𝑧) " √$

P-value: Pr t

|CA)]2 | , √#

>4

|CA)]2| , √#

" √$

√#

"

; 𝐻\ = µ = µ\ u = 2Prt

|CA)]2 | , √#

>4

|CA)]2 | , √#

; 𝐻\ = µ = µ\u

TESTS ON THE PROPORTION Any test on a parameter with population proportion p is based on its estimator which is the sample proportion p. (𝑋& , … , )𝑖. 𝑋$ 𝑖. 𝑑. with Bernoulli distribution of parameter p, with 𝐸 a𝑃n c = 𝑝 𝑉𝑎𝑟a𝑃n c =

𝑝(1 − 𝑝) 𝑝(1 − 𝑝) 44444𝑠𝑒 = p 444444 𝑛 𝑛

Sample proportion: if p = p0 under the null hypothesis then the expected value of the sample proportion should be equal to the value p0. 𝑃 − 𝑝\ 𝑝\ (1 − 𝑝\ ) 𝑛 For X with student’s T distribution, cumulative probability, Pr X≤x: pt(xn,n) Quantile for a student’s T distribution, Pr(X≤q)=1- ⍺ : qt(1-⍺,n) Confidence intervals on population mean: t.test(variable,conf.level=1-⍺) Confidence interval by default (95%): t.test(variable) (MISSING PART ON CHI SQUARE AND CHAPTER ON LINEAR REGRESSION MODEL)...


Similar Free PDFs