Probability and Statistics Lecture notes PDF

Title	Probability and Statistics Lecture notes
Author	Sav Tappenden
Course	BSc Mathematics
Institution	University of Sussex
Pages	137
File Size	2.4 MB
File Type	PDF
Total Downloads	54
Total Views	156

Preview

CLICK TO PREVIEW PDF

Summary

Probability and Statistics Lecture notes...

Description

Multivariate Probability and Mathematical Statistics An introduction 1 Nicos Georgiou2 March 26, 2015

1

Based on notes by Charles Goldie, these lectures notes will update regularly during the term 2 www.math.sussex.ac.uk

ii

Contents 1 Prerequisites

1

2 Vector random variables 2.1 Joint distribution of random variables . . . . . . . . . . . . . 2.1.1 Discrete random vectors. . . . . . . . . . . . . . . . . 2.1.2 The multinomial distribution. . . . . . . . . . . . . . . 2.1.3 Continuous random vectors. . . . . . . . . . . . . . . . 2.1.4 Uniform random vector. . . . . . . . . . . . . . . . . . 2.2 Marginals and Moments . . . . . . . . . . . . . . . . . . . . . 2.3 Transformation of vector random variables. . . . . . . . . . . 2.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Multidimensional moments . . . . . . . . . . . . . . . 2.4.2 Covariance . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Linear dependence and linear models. . . . . . . . . . 2.4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Special Topics * . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Distribution of sums via m.g.f. . . . . . . . . . . . . . 2.5.2 Exchangeability. . . . . . . . . . . . . . . . . . . . . .

3 3 3 6 8 11 12 16 20 20 21 24 25 30 34 34 34

3 Conditional Expectation 3.1 Conditional mass and density functions. . . . . . . . . . . . . 3.1.1 Conditional probability mass/density functions. . . . . 3.2 Conditional expectations . . . . . . . . . . . . . . . . . . . . . 3.3 Special Topics * . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Semi-discrete densities . . . . . . . . . . . . . . . . . . 3.3.2 Conditioning on events . . . . . . . . . . . . . . . . . . 3.3.3 Conditional expectation as a random variable . . . . . 3.3.4 Sums of random index . . . . . . . . . . . . . . . . . .

35 35 36 41 45 45 46 48 48

4 Statistical Distributions 49 4.1 Exponential family of distributions (EFD) . . . . . . . . . . . 49 4.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 54 iii

iv

CONTENTS 4.2

Multivariate normal distribution . . . . . . . . . . . . . . . . 4.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . Statistics and common distributions . . . . . . . . . . . . . . 4.3.1 The chi squared distribution . . . . . . . . . . . . . . 4.3.2 Student’s t - distribution . . . . . . . . . . . . . . . . 4.3.3 Fisher’s distribution . . . . . . . . . . . . . . . . . . . 4.3.4 Distributions of basic statistics . . . . . . . . . . . . . 4.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . Special topics ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Multivariate Central Limit Theorem . . . . . . . . . .

56 60 61 62 63 64 65 68 69 69

5 Parameter estimation 5.1 Unbiased Estimators (UBE) . . . . . . . . . . . . . . . . . . . 5.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Maximum Likelihood Estimators (MLE) . . . . . . . . . . . . 5.3 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Unbiased Minimum Variance Estimators (UMVE) . . . . . . 5.6 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Bayesian Estimators . . . . . . . . . . . . . . . . . . . . . . . 5.8 The German tank problem . . . . . . . . . . . . . . . . . . .

71 73 75 76 79 85 86 86 86 86 86

4.3

4.4

6 Hypothesis testing 87 6.1 Statistical hypothesis testing and errors . . . . . . . . . . . . 87 6.2 Testing for the mean of a normal population. . . . . . . . . . 89 6.2.1 Test for the mean with known population variance. . . 89 6.2.2 Test for the mean with unknown population variance. 93 6.2.3 Testing for the mean difference of two normal populations . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 Confidence intervals (1) . . . . . . . . . . . . . . . . . . . . . 100 6.4 Tests for variances . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.1 Test for value of a variance of a normal population . . 102 6.4.2 Test for ratio of variances of two normal populations. 106 6.5 The p-value of a test. . . . . . . . . . . . . . . . . . . . . . . . 107 6.6 Test for proportions . . . . . . . . . . . . . . . . . . . . . . . 108 6.7 Goodness of fit tests . . . . . . . . . . . . . . . . . . . . . . . 110 6.7.1 Contingency tables . . . . . . . . . . . . . . . . . . . . 115 6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7 Tables

125

List of Figures 2.1 6.1

6.2

6.3

6.4

6.5

6.6

6.7

6.8

6.9

Left: Scatter plot for (X, V ): Right: The residuals. Horizontal axis is X. . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

Critical region (shaded) for H1 : µ > µ0 . We reject H0 in favor of H1 when the value z of the Z-statistic falls in the shaded region. The area of the shaded region is exactly α, the probability of Type I error. . . . . . . . . . . . . . . . . . 90 Critical region (shaded) for H1 : µ < µ0 . We reject H0 in favor of H1 when the value z of the Z-statistic falls in the shaded region. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Critical region (shaded) for H1 : µ 6= µ0 . We reject H0 in favor of H1 when the value z of the Z-statistic falls in the shaded region. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 The cyan shaded region gives the Type II error, while the orange region’s area is α, the significance level which equals the Type I error. The centered gaussian is Z1 . . . . . . . . . . 94 Critical region (shaded) for H1 : σ 2 6= σ 20 . We reject H0 in favor of H1 when the value v of the V -statistic falls in the shaded region. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Comparison of two Gamma p.d.f.’s with different scaling: The purple one is the p.d.f. of S02 and the green one the p.d.f. of S12 (smaller scaling parameter). . . . . . . . . . . . . . . . . . 104 Critical region (shaded) for H1 : σ 2 > σ 20 . We reject H0 in favor of H1 when the value v of the V -statistic falls in the shaded region. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Critical region (shaded) for H1 : σ 2X 6= λσY2 . We reject H0 in favor of H1 when the value f of the F -statistic falls in the shaded region. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Comparison of the Qn p.d.f.’s under H0 (purple) and under H1 (green). The green one gives mass to larger values for Qn . 114

v

vi

LIST OF FIGURES

1

Prerequisites

1

2

1. PREREQUISITES

2

Vector random variables Suppose that we have several random variables X1 , X2 , . . . , Xn defined on the same probability space Ω. We have already encountered several random variables on the same probability space when we were discussing sums of independent random variables Z = X1 +X2 . At that time we were interested in the distribution Z, in particular the density fZ (z) which can be found from the convolution of the densities fX1 and fX2 . By the end of this chapter several different methods will be available to us, usually easier than convolutions, to find distributions of sums. From now on we consider the case where we have several random variables X1 , X2 , . . . , Xn on Ω where we want to study simultaneously. These random variables can be viewed as coordinates of a single random variable X = (X1 , X2 , . . . , Xn ) : Ω → Rn . Random vector X is also a random variable on Ω and we want to understand its distribution, i.e. for reasonable sets B ⊆ Rn we want to understand the distribution function P{X ∈ B}. We draw parallelisms from the single random variable analysis. The first step, as in the single variable case, is to define mass or density functions.

2.1 2.1.1

Joint distribution of random variables Discrete random vectors.

Let X1 , X2 , . . . , Xn be discrete random variables, X = (X1 , . . . , Xn ). We assume that the Xi ’s are all defined on the same probability space Ω and each Xi takes values in some discrete space Si ⊂ R. This implies that that n X : Ω −→ S = ⊗i=1 Si ⊂ R n .

Symbol ⊗ is there to remind us that it might not be necessary for the domain to be written as a product space, though it is always possible to write it like that if we allow events of probability 0. 3

4

2. VECTOR RANDOM VARIABLES

Example 2.1.1. Let D1 and D2 the outcome of two die rolls and let S = D1 + D2 . Consider the random vector X = (D1 , D2 , S ). The support of X (i.e. atoms of strictly positive probability) is the set SX = {(i, j, i + j) : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6} that is not a product space, however it is a subset of the proper product space S = {1, . . . , 6}× {1, . . . , 6}× {2, . . . , 12}, though several elements (e.g. (1, 2, 12)) have probability 0. The joint probability mass function of X1 , . . . , Xn (or equivalently the mass function of X) is fX (x) = fX1 ,...,Xn (x1 , . . . xn ) = P{X1 = x1 , X2 = x2 , . . . , Xn = xn } (2.1.1) Since fX (x) is a mass function, it must be the case that it is non-negative and the sum over all possible values x equals 1. In symbols, X

fX (x) =

x∈S

X

X

x1 ∈S1 x2 ∈S2

···

X

fX1 ,...,Xn (x1 , . . . xn ) = 1.

(2.1.2)

xn ∈Sn

As in the single variable case, we can immediately device a formula for the distribution function of X; when evaluated properly it gives us the probability that X takes values in some set A ⊆ S ⊂ Rn : P{X ∈ A} =

X

fX (x).

(2.1.3)

x∈A

To demonstrate the two basic ideas above (and to connect with more familiar 1-dimensional concepts) we begin with a two dimensional example. Example 2.1.2 (Two dice, part 1). Consider two distinct six-sided dice. One is fair, but the other has been modified to favor number 4 with probability 7/12, while its remaining sides have equal probability. We roll both dice once and record their entries as a pair, X = (D1 , D2 ), where Di is the number of pips rolled by die i. Define M = max{D1 , D2 } and Z = min{D1 , D2 }. 1. Find an appropriate state space S for X. Then compute the joint mass function fX (x). 2. Find an appropriate state space S for the pair (M, Z ). Then compute the joint mass function fM,Z (m, z ). Solution: 1. R.v. X is a two-dimensional discrete random variable, so the state space will be a cartesian product of the state spaces of the coordinates

2.1. JOINT DISTRIBUTION OF RANDOM VARIABLES

5

D1 and D2 . Since both Di are die entries, their respective state spaces are Si = {1, . . . , 6}. Then S = {(i, j) : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6} = S1 × S2 .

(2.1.4)

The joint mass function can be found when we find values for probabilities of the form P{X = (i, j)} for all i, j. Note that card(S ) = 36 and its quite cumbersome to write down 36 equations. For this reason we organize the data in a table: D2

1

2

3

4

5

6

1 6·12 1 6·12 1 6·12 1 6·12 1 6·12 1 6·12

1 6·12 1 6·12 1 6·12 1 6·12 1 6·12 1 6·12

1 6·12 1 6·12 1 6·12 1 6·12 1 6·12 1 6·12

7 6·12 7 6·12 7 6·12 7 6·12 7 6·12 7 6·12

1 6·12 1 6·12 1 6·12 1 6·12 1 6·12 1 6·12

1 6·12 1 6·12 1 6·12 1 6·12 1 6·12 1 6·12

D1 1 2 3 4 5 6

Mass function for X Question: How are the entries computed? Answer: To compute each entry for the table we used the already known notion of independence: Define the events Ai = {D1 = i} and Bj = {D2 = j}. Then (!)

P{X = (i, j)} = P{D1 = i, D2 = j} = P{Ai , Bj } = P{Ai }P{Bj } 2. The pair (M, Z) still has the same state space S as in (2.1.4) (why?). Computing the entries for the table now becomes slightly more challenging as M and Z are no longer independent. To see this last statement on a heuristic level, consider the case where M = 1. Since M is the maximum of the two die rolls, it must be that the minimum Z also equals 1. Since information about M gives (a lot) of information about Z, independence is lost. We indicatively compute some entries for the mass function and the rest is left to the reader. First of all note that P{M = m, Z = z} = 0,

∀ m < z.

6

2. VECTOR RANDOM VARIABLES This is because the maximum of the die rolls cannot be strictly smaller that the minimum. For the remaining cases (m ≥ z), we need to interpret the event {M = m, Z = z} in terms of events involving D1 and D2 with probabilities that are easy to compute. For example, 



P{M = 4, Z = 2} = P (D1 , D2 ) ∈ {(2, 4), (4, 2)} (!)



= P{(D1 , D2 ) = (2, 4)} + P (D1 , D2 ) = (4, 2)} = fX (2, 4) + fX (4, 2) = 1/9. Already here we saw an instance of independence between two coordinates of a random two-dimensional vector. In particular, this can be generalized to all dimensions: X = (X1 , X2 , . . . , Xn ) has independent coordinates then we can decide when those are independent by looking at the mass functions, as given by the following definition. Definition 2.1.3 (Independence for discrete random variables). Let X = n (X1 , X2 , . . . , Xn ) a discrete random vector on a product space S = ×i=1 Si . Then the coordinates X1 , . . . , Xn are mutually independent random variables with corresponding mass functions fXi (xi ) and support Si , if for any 1 ≤ k ≤ n the joint density of a vector formed by k of them X(k) = (Xi1 , . . . , Xik ) can be written as fX(k) (xi1 , . . . , xik ) = fXi1 (xi1 )fXi2 (xi2 ) . . . fXik (xik ) : Si1 × . . . × Sik → Rk

2.1.2

The multinomial distribution.

Consider an experiment of extracting n many balls from a (gigantic and full) box that contains balls of r different colors (categories). After each extraction, the ball is replaced in the box and independent trials of this experiment are repeated until n balls are drawn and their color (but not the order they are drawn) is recorded. Thus, a ball of color i always has the P same probability pi of being drawn, and it must be the case that ri=1 pi = 1 (why?) . Let Xi = number of drawn balls of color i,

1 ≤ i ≤ r.

We are interested in the random vector X = (X1 , . . . , Xr ) ∈ {0, 1, . . . , n}r . P

First note that the sum ri=1 Xi = n (why?) , so already there is some dependency between the coordinates. We will compute its p.m.f. fX (x1 , x2 , . . . , xr ) = P{X1 = x1 , X2 = x2 , . . . Xr = xr }

2.1. JOINT DISTRIBUTION OF RANDOM VARIABLES

7

P

and observe that if ri=1 xi 6= n then fX (x1 , x2 , . . . , xr ) = 0. Thus we P compute the p.m.f. under the constraint that ri=1 xi = n. Consider for a moment that we just want to compute the probability of a given ordered arrangement Y = (y1 , y2 , . . . , yn ), with each yj belonging in one of the r categories and so that there are xi -many yj ’s from category i. Then, since the draws are independent by assumption, the probability of drawing an element of category i at the j -th spot is exactly pi (independent of j). Thus P{Y = (y1 , y2 , . . . , yn )} = py11 p2y2 . . . pryr . This probability does not depend on the ordering, it only depends on how many elements we have from each category. This makes it now a combinatorial problem to compute the p.m.f. of X: We are given n objects, each one belonging in exactly one of r-many categories. In particular, xi many objects from category i. To compute the probability we must first count how many (ordered) arrangements of these objects exist and add the probabilities of each arrangement. The number of arrangements is not difficult to compute. There are n! ways to arrange all objects in a row, but we over-counted since permutations among elements of the same category give the same global ordering. Compensating for that we have n! n = x1 ! x2 ! . . . xr ! x1 , x2 , . . . , xr

!

(2.1.5)

many ordered arrangements of n objects with xi -many of them in category i, 1 ≤ i ≤ r. The symbol defined in equation (2.1.5) is called the multinomial coefficient (to see why, look at the exercises). Now the p.m.f. of X is

fX (x1 , x2 , . . . , xr ) =

!  n    px1px2 . . . prxr,    x1 , x2 , . . . , xr 1 2      

0,

if

r X

xi = n

i=1

otherwise . (2.1.6)

Definition 2.1.4 (Multinomial distribution). We say a random vector X follows the multinomial distribution with n independent trials and r categories/classes/types of outcomes X ∼ Mult(n, r, p1 , . . . , pr ) if and only if it has p.m.f. given by (2.1.6) with the parameters satisfying n, r ∈ N and P pi = 1. We discuss a subtle point that might arise, especially when reading various sources on the multinomial distribution. If we have r types of possible outcomes in the experiment and n trials, it is not necessary to study the

8

2. VECTOR RANDOM VARIABLES P

vector X = (X1 , . . . , Xr−1 , Xr ) under the constraint Xi = n. Instead, note that Xr , pr (or any Xi , pi for that matter) can be written as Xr = n − X1 − . . . − Xr−1 ,

pr = 1 − p1 − . . . − pr−1 ,

and it suffices to study the r −1 dimensional vector X(r−1) = (X1 , . . . , Xr−1 ) P under the constraint r−1 i=1 Xi ≤ n. The density then becomes,

!  P X n  n− xi   P p1x1 p2x2 . . . (1 − , p ) i   x1 , . . . , n − xi    r−1 X fX(r−1) (x1 , . . . , xr−1 ) = if xi ≤ n     i=1    

0, otherwise .

(2.1.7) P This should not be surprising: The r-dimensional simplex {x : ri=1 xi = n} is actually an r − 1 dimensional space, so we can reduce one dimension without any loss of information. To further illustrate the point, consider the following example. Example 2.1.5. Let X ∼Bin(n, p). Then (X, n − X) ∼ Mult(n, 2; p, 1 − p).

Solution: Let Y = n − X and we will use the transformation y = n − x when necessary. This substitution implies that x + y = n otherwise we cannot carry out the computation below. We write !

n! n x p (1 − p)n−x = px (1 − p)n−x fX (x) = x x!(n − x)! n! x = p (1 − p)y , x + y = n x!y! = fX,Y (x, y) = fX,n−X (x, y). This fact easy to explain without appeal to formulas. In the case of the Binomial, we have n trials marked by success or failure and we are interested in the number of successes (which incidentally gives the number of failures as well). However, there are two possible outcomes for the experiment with complementary probabilitites, therefore it must also be possible to model this with a multinomial distribution. To avoid confusion with the notation, and since it will always be obvious from the context which of the two forms of the p.m.f. ((2.1.6) or (2.1.7)) we P are using , we still say that X(r−1) ∼ Mult(n, r, p1 , . . . , pr−1 , 1 − r−1 i=1 p i ), Pr i.e. the last coordinate n − i=1 Xi is always implied.

2.1.3

Continuous random vectors.

Let X1 , X2 , . . . , Xn be continuous random variables, X = (X1 , . . . , Xn ). As in the discrete case, assume that the Xi ’s are all defined on the same prob-

2.1. JOINT DISTRIBUTION OF RANDOM VARIABLES

9

ability space Ω and each Xi takes values in some continuous space Si ⊂ R. n S ⊂ Rn . X : Ω −→ S = ⊗i=1 i

The object that substitutes the notion of the joint mass function is the joint probabili...