Convergence notes new PDF

Title	Convergence notes new
Author	Ming Fu
Course	Nonparametric Analysis
Institution	Columbia College
Pages	8
File Size	169.6 KB
File Type	PDF
Total Downloads	64
Total Views	165

Preview

CLICK TO PREVIEW PDF

Summary

convergence notes week 1...

Description

COLUMBIA UNIVERSITY Nonparametric Statistics Probability review

Prof. Marco Avella Spring 2020

Calculating the significance level and probability of Type II error of most of the tests that we see in this course, requires nontrivial calculations. One of the main tools that will help us in such cases is the asymptotic performance analysis. With very little computational power, such analysis can usually provide sufficiently accurate results, especially when the number of samples is high. In this lecture, we summarize almost sure convergence, and convergence in distribution and in probability briefly. We will also summarize some of the results that will be used throughout this course. We will see many applications of these results later on the course.

1

Convergence in distribution

Let X1 , X2 , . . . , Xn , . . . be a sequence of random variables with Xn ∼ Fn . Also, let ∞ X ∼ F . The sequence {Xn }n=1 is said to converge to X in distribution, denoted d by Xn → X, if and only if Fn (x) → F (x) for all continuity points of F as n → ∞. Remark Note that it is usually easier to prove that the probability density function of Xi converges to the probability density function of X. It turns out that if the pdf of Xn converges to the pdf of X, then we can conclude that d Xn → X. This is a simple corollary of Scheffe’s lemma. I do not expect you to know Scheffe’s lemma. But, just remember that if you prove the convergence of pdf’s you can conclude convergence in distribution. The same is true for the probability mass functions, i.e., if you prove the convergence of the probability mass functions, then you again have the convergence in distribution. Example 1. Let Xi ∼ N ( n1 , 1). What is the limiting distribution of {Xi }∞ i=1 ? It is straightforward to confirm that the pdf of Xi is converging to the pdf of N (0, 1). Therefore, N (0, 1) is the limiting distribution. A well-known example of convergence in distribution is the Central Limit Theorem (CLT). i.i.d.

2 Theorem 1. Let X1 , X2 , . . . ∼ F , with i ) = µ < ∞ and var(Xi ) = σ < ∞. PE(X n 1 Define the sample average as X n = n i=1 Xi . Then

√

d

¯ n − µ) → N (0, σ 2 ). n(X

Example 2. Let Xn be a χn2 . What is the limiting distribution of Yn = n)? 1

1 √ (Xn n

−

iid

Let Z1 , Z2 , . . . , Zn , . . . ∼ N (0, 1). Since Xn is χn2 , its distribution is the same as Pn the distribution of i=1 Zi2, which is the sum of i.i.d. random variables Zi@ . 2) = 1. Also, Note that E(Zi (a)

var(Zi2) = E(Zi2 − 1)2 = 2. To obtain Equality (a), we have used the fact that E(Zi 4) = 3. Can you prove it? √ Now it is clear from CLT that the random variable n(n1(Xn ) − 1) → N (0, 2). It d

is also straightforward to confirm that if Yn = √n1 (Xn − n), then Yn → N (0, 2).

2

Convergence in probability

∞ . The sequence is said to conConsider a sequence of random variables {Xi }i=1 p verge to X in probability, denoted by Xn → X, if and only if ∀ǫ > 0

lim P(|Xn − X| < ǫ) = 1.

n→∞

Example 3. Consider Xn ∼ exponential(λn), i.e., the pdf of Xn is fXn (x) = p λne−λnxI(x ≥ 0).λ > 0 is fixed and I is the indicator function. Prove that Xn → 0 as n → ∞. Proof. In order to prove this convergence we should calculate P(|Xn | > ǫ). It is straightforward to confirm that P(|Xn | > ǫ) = e−λnǫ → 0. n→∞

p

Hence Xn → 0. A well-known example of convergence in probability, that will be used in this course often, is the weak law of large numbers (WLLN). Theorem 2 (WLLN). Let X1 , X2 , . . . , Xn , . . . be i.i.d. with E(Xi ) = µ < ∞ and p var(Xi ) = σ 2 < ∞. Then X n → µ The proof of this statement is straightforward and I encourage the interested readers to prove this for themselves. The only inequality that you may need to prove this result is Markov’s inequality.

3

Almost sure convergence (Optional)

The last type of convergence we discuss here is the almost sure convergence. a.s. ∞ is said to converge to X almost surely denoted by Xn → X if and only {Xn }n=1 if P(ω : lim Xn (ω) = X(ω)) = 1. n→∞

2

Almost sure convergence is also known as convergence with probability one. We don’t use the almost sure convergence in this course. A well-known example of almost sure convergence is the Strong Law of Large Numbers. Theorem 3 (SLLN). Let X1 , X2 , . . . , Xn , . . . be iid with E(Xi ) = µ < ∞ and a.s. var(Xi ) = σ 2 < ∞. Then X n → µ. Note that among the three types of convergence we have discussed so far, almost a.s. sure convergence is the strongest. In fact if Xn → X then Xn converges to X in both probability and distribution. We discuss some parts of this statement in the next section and will leave the rest as an exercise for you.

4

Connection of convergence in probability and convergence in distribution

Here we summarize a few theorems from the course on probability theory that will be useful throughout this course. In Homework ?? you will play with these results. p

d

Theorem 4. If Xn → X, then Xn → X.1 Proof. We would like to prove that limn→∞ P(Xn ≤ a) = P(X ≤ a). Let ǫ > 0 be a fixed number. Here we employ a very useful trick known as conditioning. We have P(Xn ≤ a) =P(Xn ≤ a| |Xn − X| < ǫ)P(|Xn − X| < ǫ)

+ P(Xn ≤ a| |Xn − X| ≥ ǫ)P(|Xn − X| ≥ ǫ)

(1)

We now use (1) to find an upper bound and a lower bound for P(Xn ≤ a). Let’s start with the upper bound. (b)

P(Xn ≤ a) ≤ P(Xn ≤ a| |Xn − X| < ǫ)P(|Xn − X| < ǫ) + P(|Xn − X| ≥ ǫ) (c)

≤ P(X ≤ a + ǫ| |Xn − X| < ǫ)P(|Xn − X| < ǫ) + P(|Xn − X| ≥ ǫ) = P(X ≤ a + ǫ| |Xn − X| < ǫ) + P(|Xn − X| ≥ ǫ)

(d)

≤ P(X ≤ a + ǫ) + P(|Xn − X| ≥ ǫ)

(2)

Inequality (b) is the result of (1) and the fact that P(Xn ≤ a| |Xn − X| ≥ ǫ) ≤ 1. To obtain Inequality (c) we use the following argument: conditioned on 1

This is an optional reading. As we mentioned before almost sure convergence implies convergence in probability. According to Theorem 4 convergence in probability also implies convergence in distribution. Therefore, we can also conclude that almost sure convergence implies convergence in distribution.

3

|Xn − Xj | < ǫ we conclude that X − ǫ < Xn < X + ǫ. Furthermore if we replace Xn with something smaller, the value of P(Xn ≤ a| |Xn − X| < ǫ) increases. Therefore, we replace Xn with X − ǫ. Finally, the last inequality is due to the fact that the probability of intersection of two events is less than or equal to the probability of each of the events. To obtain a lower bound we again use (1). Since P(Xn ≤ a| |Xn − X| ≥ ǫ)P(|Xn − X| ≥ ǫ) ≥ 0, we have P(Xn ≤ a) ≥ P(Xn ≤ a| |Xn − X| < ǫ)P(|Xn − X| < ǫ) (e)

≥ P(X ≤ a − ǫ)| |Xn − X| < ǫ)P(|Xn − X | < ǫ)

(f )

= P(X ≤ a − ǫ) − P(X ≤ a − ǫ| |Xn − X| ≥ ǫ)P(|Xn − X| ≥ ǫ) (g)

≥ P(X ≤ a − ǫ) − P(|Xn − X| ≥ ǫ)

(3)

We obtain Inequality (e) by the following argument: |Xn − X| < ǫ implies that Xn ≤ X +ǫ. Also if we replace Xn with something larger, P(Xn ≤ a| |Xn −X| < ǫ) decreases. Hence we replace Xn with X + ǫ and obtain Inequality (e). Equality (f) is due to the fact that P(X ≤ a − ǫ) =P(X ≤ a − ǫ| |Xn − X| ≥ ǫ)P(|Xn − X| ≥ ǫ) + P(X ≤ a − ǫ| |Xn − X| < ǫ)P(|Xn − X| < ǫ). Finally, to obtain Inequality (g) from P(X ≤ a−ǫ| |Xn −X| ≥ ǫ) ≤ 1. Combining (2) and (3) we obtain P(X ≤ a − ǫ) − P(|Xn − X| ≥ ǫ) ≤ P(Xn ≤ a) ≤ P(X ≤ a + ǫ) + P(|Xn − X| ≥ ǫ). p

Letting n → ∞ implies that (since Xn → X) for every ǫ ≥ 0. P(X ≤ a − ǫ) ≤ lim P(Xn ≤ a) ≤ P(X ≤ a + ǫ). n→∞

(4)

If a is a continuity point of F , then letting ǫ → 0 establishes the result (by using Sandwich theorem for limits). For those of you who have seen lim sup and lim inf, note that (4) is slightly sloppy. You may want to make it more accurate. Those of you who are not familiar or have forgotten these topics need not to worry about this. While convergence in probability implies convergence in distribution, the other direction is not correct. i.e., convergence in distribution does not imply convergence in probability. Here is an example: ∞ , with Xn = X. First note Example 4. Let X ∼ Unif(0, 1). We define {Xn }n=1 d that Xn → X. Why?

4

d

Since X has the same distribution as 1 − X, we can conclude that Xn → 1 − X. However, I claim Xn does not converge to 1 − X in probability. Because, P(|Xn − 1 + X| ≥ ǫ) = P(|X − 1 + X| ≥ ǫ) = 1 − ǫ, where ǫ is assumed to be less than 0.5. Make sure to calculate the probability yourself and prove that P(|X − 1 + X| ≥ ǫ) = 1 − ǫ. As this example shows the convergence in distribution does not imply the convergence in probability. That said, there is a very special but important partial inverse for Theorem 4 that we will mention in Theorem 5. d

p

Theorem 5. Let c ∈ R be a fixed (nonrandom) number. If Xn → c, then Xn → c. Proof. Let F (x) be defined in F (x) , I(x ≥ c). P(|Xn − c| ≤ ǫ) ≥ P(c − ǫ < Xn ≤ c + ǫ) = FXn (c + ǫ) − FXn (c − ǫ)

(5)

Clearly FXn converges to F and the only discontinuity point of F occurs at x = c. Therefore, lim FXn (c + ǫ) = F (c + ǫ) = 1

n→∞

lim FXn (c − ǫ) = F (c − ǫ) = 0

n→∞

(6)

Combining (5) with (6), we obtain lim P(|Xn − c| ≤ ǫ) ≥ 1.

n→∞

But we also know that P(|Xn −c| ≤ ǫ) ≤ 1. Therefore, limn→∞ P(|Xn −c| ≤ ǫ) = 1, that completes the proof. Note that this proof is not fully rigorous. I encourage those of you interested in probability theory, to find the places it is not rigorous and fix them.

5 5.1

Useful results Motivation

So far, we have studied two different convergences: convergence in probability and convergence in distribution, and discussed their connections in detail. We have also studied two important instances of these convergences, weak law of large numbers and central limit theorem. To obtain the limiting distributions of different estimators and different tests, we usually need to combine different convergence results. Here is an example to clarify this statement: 5

iid

Example 5. Let X1 , X2 , . . . , Xn ∼ F , with mean µ and variance σ 2 , where both µ and σ 2 are unknown. We estimate the variance by n

n

σˆ 2 =

X 1X ¯ 2= 1 (Xi − X) X 2 − X¯ 2 , n i=1 i n i=1

(7)

P ¯ = 1 n Xi . Can you prove the last equality? The question is what where X i=1 n does σˆ 2 converge to in probability. If we look at the right hand side of (7) Pn 1 2 we notice that the term n i=1 Xi converges in probability to E(Xi2). Why? ¯ → E(X) in probability. Now the question is can we employ Furthermore, X these individual results to obtain a new result regarding σˆ 2 ? In other words, p can we say that σˆ 2 → E(X 2 ) − (E(X))2 ? Furthermore, can we employ CLT for √ instance, to characterize the distribution of n(ˆ σ 2 −σ 2 )? In the next few sections, we introduce some tools that enable you to do these calculations.

5.2

Slutsky’s theorem

The first question we would like to answer is the following: Suppose that we have d d two sets of random variables Xn → X and Yn → Y . Can we characterize the limiting distribution of Xn + Yn ? The answer to this question is NO. This is due to the fact that marginal distributions of Xn and Yn does not include all the d information about their joint behavior. For instance, suppose that Xn → N (0, 1). d Then define Yn = −Xn . It is clear that Yn → N (0, 1) as well (Can you prove this d

claim?). However, Xn + Yn → 0. Now if we define a new set of random variables d Wn = Xn , then Xn + Wn → 2N (0, 1). Why? Therefore, as long as we do not have a good understanding of the joint distribution of (Xn , Yn ), we cannot derive the limiting distribution of Xn + Yn . Slutsky’s theorem is a special case for which we can characterize the distribution of Xn + Yn from the limiting distribution of Xn and the limiting distribution of Yn only. d

d

Theorem 6 (Slutsky’s theorem). Let Xn → X and Yn → c, where X is a random variable, but c ∈ R is a fixed number. Then, d

1. Xn + Yn → X + c, d

2. Xn Yn → cX, 3. If c 6=, then

Xn Yn

d

→

X c

. d

p

Proof. We only prove part (1) here. First note that Yn → c implies Yn → c. Keeping that in mind, we use the conditioning argument we discussed before and

6

we provide an upper bound and a lower bound for P(Xn + Yn ≤ t). P(Xn + Yn ≤ t) (a)

= P(Xn + Yn ≤ t| |Yn − c| ≤ ǫ)P(|Yn − c| ≤ ǫ) + P(Xn + Yn ≤ t| |Yn − c| > ǫ)P(|Yn − c| > ǫ)

(b)

≤ P(Xn + Yn ≤ t| |Yn − c| ≤ ǫ)P(|Yn − c| ≤ ǫ) + P(|Yn − c| > ǫ) (c)

≤ P(Xn ≤ −c + ǫ + t| |Yn − c| ≤ ǫ)P(|Yn − c| ≤ ǫ) + P(|Yn − c| > ǫ) (d)

≤ P(Xn ≤ −c + ǫ + t) + P(|Yn − c| > ǫ)

(8)

Furthermore, P(Xn + Yn ≤ t) ≥ P(Xn ≤ −c − ǫ + t) − P(|Yn − c| ≥ ǫ).

(9)

If we combine (8) and (9) and let n → ∞, we obtain P(X ≤ −c − ǫ + t) ≤ lim P(Xn + Yn ≤ t) ≤ P(X ≤ −c + ǫ + t) ∀ǫ > 0. n→∞

If we let ǫ → 0 we obtain the result.

d

(10)

d

As we discussed before Slutsky’s theorem, Xn → X and Yn → Y does not imply p anything about the distribution of Xn + Yn . The next theorem says that Xn → X p p and Yn → Y in fact imply that Xn + Yn → X + Y . p

p

p

Theorem 7. If Xn → X and Yn → Y , then Xn + Yn → X + Y . Note that this result can be extended to the products of two random variables p and if P (Y = 0) = 0, then we can say Xn /Yn → X/Y . One of the issues that led us to Slustky’s theorem was that we could not say anything about the limiting distribution of Xn + Yn even though we knew the limiting distribution of Xn and Yn individually. To be able to say things about the distribution Xn + Yn we have to study the joint distribution of these two random variables. This leads us to the study of convergence of random vectors.

5.3

Convergence in distribution for random vectors

Let X1 , X2 , . . . , Xn , . . . be random vectors in Rd . Xn converges to X in distrid bution, denoted by Xn → X, if and only if the joint distribution of FXn (a) → FX (a) for any a ∈ Rd that is a continuity point of FX (a). An important instance of convergence in distribution is the central limit theorem (CLT). Theorem 8. Let X1 , X2 , . . . , Xn , . . . be independent and identically distributed random vectors with E(Xi ) = µ and the covariance matrix E(Xi − µ)(Xi − µ)T = Σ. Then, 7

√

n

! n 1X d Xi − µ → N (0, Σ). n i=1

Now, let us explain a quite general theorem that includes the types of results we intended to prove regarding the limiting distribution of Xn + Yn . This result is known as continuous mapping theorem. d

Theorem 9 (Continuous mapping theorem). Let Xi ∈ Rd and Xi → X, the if d g : Rd → Rℓ is a continuous function, then we have g(Xi ) → g(X) in distribution. A similar theorem holds for convergence in probability as well. d

Theorem 10 (Continuous mapping theorem). Let Xi ∈ Rd and Xi → X, then d if g : Rd → Rℓ is a continuous function, then we have g(Xi ) → g(X). As a simple corollary of this theorem you may prove that if random vectors d d (Xi , Yi ) → (X, Y ) in distribution, then Xi + Yi → X + Y . Why?

8...