A class of models for Bayesian predictive inference PDF

Title A class of models for Bayesian predictive inference
Author Pietro Rigo
Pages 29
File Size 272.3 KB
File Type PDF
Total Downloads 38
Total Views 79

Summary

Submitted to Bernoulli A class of models for Bayesian predictive inference PATRIZIA BERTI, EMANUELA DREASSI, LUCA PRATELLI and PIETRO RIGO Dipartimento di Matematica Pura ed Applicata “G. Vitali”, Università di Modena e Reggio-Emilia, via Campi 213/B, 41100 Modena, Italy E-mail: patrizia.berti@unim...


Description

Submitted to Bernoulli

A class of models for Bayesian predictive inference PATRIZIA BERTI, EMANUELA DREASSI, LUCA PRATELLI and PIETRO RIGO Dipartimento di Matematica Pura ed Applicata “G. Vitali”, Universit` a di Modena e Reggio-Emilia, via Campi 213/B, 41100 Modena, Italy E-mail: [email protected] Dipartimento di Statistica, Informatica, Applicazioni Universit` a di Firenze, viale Morgagni 59, 50134 Firenze, Italy E-mail: [email protected] Accademia Navale di Livorno viale Italia 72, 57100 Livorno, Italy E-mail: [email protected] Dipartimento di Scienze Statistiche “P. Fortunati”, Universit` a di Bologna, via delle Belle Arti 41, 40126 Bologna, Italy E-mail: [email protected] In a Bayesian framework, to make predictions on a sequence X1 , X2 , . . . of random observations,  the inferrer needs to assign the predictive distributions σn (·) = P Xn+1 ∈ · | X1 , . . . , Xn . In this paper, we propose to assign σn directly, without passing through the usual prior/posterior scheme. One main advantage is that no prior probability has to be assessed. The data sequence (Xn ) is assumed to be conditionally identically distributed (c.i.d.) in the sense of [4]. To realize this programme, a class Σ of predictive distributions is introduced and investigated. Such a Σ is rich enough to model various real situations and (Xn ) is actually c.i.d. if σn belongs to Σ. Furthermore, when a new observation Xn+1 becomes available, σn+1 can be obtained by a simple recursive update of σn . If µ is the a.s. weak limit of σn , conditions for µ to be a.s. discrete are provided as well. MSC 2010 subject classifications: 62G99, 62F15, 62M20, 60G25, 60G57. Keywords: Bayesian nonparametrics, Conditional identity in distribution, Exchangeability, Predictive distribution, Random probability measure, Sequential predictions, Strategy.

1. Introduction The object of this paper is Bayesian predictive inference for a sequence of random observations. Let (Xn : n ≥ 1) be a sequence of random variables with values in a measurable 1

2

Berti et al.

space (S, B). Assuming that (X1 , . . . , Xn ) = x, for some n ≥ 1 and x ∈ S n , the problem consists of predicting Xn+1 based on the observed data x. In a Bayesian framework, this means to assess the predictive distribution, say  σn (x)(B) = P Xn+1 ∈ B | (X1 , . . . , Xn ) = x for all B ∈ B. To address this problem, the Xn can be taken to be the coordinate random variables on S ∞ . Accordingly, in the sequel, we let Xn (s1 , . . . , sn , . . .) = sn for each n ≥ 1 and each (s1 , . . . , sn , . . .) ∈ S ∞ . Also, to avoid needless technicalities, S is assumed to be a Borel subset of a Polish space and B the Borel σ-field on S. Let P denote the collection of all probability measures on B. Following Dubins and Savage [15], a strategy is a sequence σ = (σ0 , σ1 , . . .) such that • σ0 ∈ P and σn = {σn (x) : x ∈ S n } is a collection of elements of P; • The map x 7→ σn (x)(B) is B n -measurable for fixed n ≥ 1 and B ∈ B. Here, σ0 should be regarded as the marginal distribution of X1 and σn (x) as the conditional distribution of Xn+1 given that (X1 , . . . , Xn ) = x. According to the Ionescu-Tulcea theorem, for any strategy σ, there is a unique probability measure P on (S ∞ , B ∞ ) satisfying  P (X1 ∈ ·) = σ0 and P Xn+1 ∈ · | (X1 , . . . , Xn ) = x = σn (x) for all n ≥ 1 and P -almost all x ∈ S n .

Such a P is denoted as Pσ in the sequel. To make predictions on the sequence (Xn ), a Bayesian inferrer needs precisely a strategy σ. The Ionescu-Tulcea theorem establishes that, for any strategy σ, the predictions based on σ are consistent with a unique probability distribution for (Xn ).

1.1. The standard and non-standard approach for exchangeable data The data sequence (Xn ) is usually assumed to be exchangeable. In that case, there are essentially two procedures for selecting a strategy σ. For definiteness, we call them the standard approach (SA) and the non-standard approach (NSA). The only reason for using these terms is that the first approach is much more popular than the second. Both approaches can be adopted to make Bayesian predictive inference and both lead to a full specification of the probability distribution of (Xn ). According to SA, to obtain σ, the inferrer should:

Bayesian predictive inference

3

• Select a prior π, namely, a probability measure on P; • Calculate the posterior of π given that (X1 , . . . , Xn ) = x, say πn (x); • Evaluate σ as Z p(B) π(dp) σ0 (B) = P

and

σn (x)(B) =

Z

p(B) πn (x)(dp)

for all B ∈ B.

P

To assess a prior π is not an easy task. In addition, once π is selected, it is also quite difficult to evaluate the posterior πn (x). Frequently, it happens that πn (x) cannot be written in closed form but only approximated numerically. On the other hand, SA is not motivated by prediction alone. Another motivation, possibly the main one, is to make inference on other features of the data distribution, such as a mean, a quantile, a correlation, or more generally some random parameter (possibly, infinite dimensional). In all these cases, the posterior πn (x) is fundamental. In short, SA is a cornerstone of Bayesian inference, but, when prediction is the main target, is possibly quite involved. Instead, NSA entails assigning σn directly, without passing through π and πn . Merely, rather than choosing π and then evaluating πn and σn , the inferrer just selects his/her predictive distribution σn . This procedure makes sense because of the Ionescu-Tulcea theorem. See [3], [6], [9], [11], [12], [16], [19], [20], [24]; see also [17], [25], [26], [29] and references therein. NSA is in line with de Finetti, Dubins and Savage, among others. Pitman’s work is fundamental as well; see e.g. [27] and [28]. In fact, NSA is usually adopted (or at least implicit) in species sampling models; see [24]. Similarly, NSA is used in [19] to obtain a fast online Bayesian prediction. Suppose that S = R and σn (x) admits a density, with respect to some fixed measure λ on B, for all n ≥ 0 and x ∈ S n . In [19], the update of predictive distributions is given a nice characterization in terms of copulas. Such a characterization, in turn, allows for making Bayesian predictions through an useful recursive procedure. In a sense, the present paper fits into the framework of [19]. From our point of view, NSA has essentially two merits. Firstly, it requires the assignment of probabilities on observable facts only. The value of the next observation Xn+1 is actually observable, while π and πn (being probabilities on P) do not deal with observable facts. Secondly, as noted in [19, Section 6], NSA is much more efficient than SA when prediction is the main goal. In this case, why select the prior π explicitly ? Rather than wondering about π, it seems reasonable to reflect on how Xn+1 is affected by (X1 , . . . , Xn ). Finally, NSA is even more appealing in a nonparametric framework, where selecting a prior with large support is usually difficult. We discuss an example to make the above remarks clearer.

4 Example 1.

Berti et al. (SA versus NSA). If (Xn ) is exchangeable, de Finetti’s theorem yields Z Y n  Pθ (Bi ) π(dθ) P X1 ∈ B1 , . . . , X n ∈ Bn = Θ i=1

for some parameter space Θ, some prior π on Θ, and some statistical model M = {Pθ : θ ∈ Θ}

where Pθ ∈ P for each θ.

In the parametric case, Θ is a Borel subset of Rk and M is dominated and smooth. In the nonparametric case, Θ is infinite-dimensional, typically Θ = P. In both cases, SA entails selecting π, evaluating the posterior πn and calculating σ as Z Pθ (B) πn (x)(dθ). σn (x)(B) = Θ

In turn, NSA entails selecting σ directly, without passing through π and πn . In our opinion, SA may be unsuitable for prediction even in the parametric framework. Not only it is hard to choose π, but to evaluate πn may be difficult as well. On the contrary, NSA usually takes the available information on the data into account more effectively. In fact, in various practical situations, arguing in terms of strategies is simpler than arguing in terms of priors. An obvious example are Polya urns, where the strategy σ is naturally determined by the sampling scheme, while the prior π is not. The merits of NSA increase further in the nonparametric framework. In that case, if prediction is the main goal, to assess a prior π and evaluate the posterior πn is really too expensive. One more remark is in order. Because of exchangeability, the probability distribution P of (Xn ) can be written as above for some M and π. However, by assigning a strategy σ, the inferrer identifies P = Pσ , not the pair (M, π). In a sense, when applying NSA, the ”model uncertainty” about θ is integrated out by the choice of σ. This appears reasonable after all, as when making predictions, the relevant object is σ not (M, π). An intriguing problem, pioneered by Diaconis, Ylvisaker and Freedman, is to give conditions on σ implying that the statistical model M underlying Pσ has a given form, for instance M is an exponential family. Such a problem, however, is not investigated in this paper. See [13], [14], [16], [30] and references therein.

1.2. Conditionally identically distributed data If (Xn ) is assumed to be exchangeable, however, NSA has a gap. Given an arbitrary strategy σ, the Ionescu-Tulcea theorem does not grant exchangeability of (Xn ) under Pσ . Therefore, for NSA to apply, one should first characterize those strategies σ which make (Xn ) exchangeable under Pσ . A nice characterization is [16, Theorem 3.1]. However, the conditions on σ for making (Xn ) exchangeable are quite hard to check in real problems. This is possibly one of the reasons why NSA has not yet been developed. Another reason is the lack of constructive procedures for determining σ. It is precisely this lack which makes SA necessary for prediction, even if analytically more involved.

Bayesian predictive inference

5

An obvious way to bypass the gap mentioned in the above paragraph is to weaken the exchangeability assumption. One option is to assume (Xn ) to be conditionally identically distributed (c.i.d.), namely   P Xk ∈ · | Fn = P Xn+1 ∈ · | Fn a.s. for all k > n ≥ 0 where Fn = σ(X1 , . . . , Xn ) and F0 is the trivial σ-field. Roughly speaking, the above condition means that, at each time n ≥ 0, the future observations (Xk : k > n) are identically distributed given the past Fn . Such a condition is actually weaker than exchangeability. Indeed, (Xn ) is exchangeable if and only if it is stationary and c.i.d. We refer to Subsection 2.1 for more information on c.i.d. sequences. Here, we just mention three reasons for taking c.i.d. data into account.

• It is not hard to characterize the strategies σ which make (Xn ) c.i.d. under Pσ ; see Theorem 3. Therefore, unlike the exchangeable case, NSA can be easily implemented. • The asymptotic behavior of c.i.d. sequences is analogous to that of exchangeable ones. • A number of meaningful strategies cannot be used if (Xn ) is assumed to be exchangeable, but are available if (Xn ) is only required to be c.i.d. See the examples in Sections 4-6. To support the latter claim, we also note that conditional identity in distribution is a more appropriate assumption than exchangeability in some real problems. Examples occur in various fields, including clinical trials, generalized Polya urns, species sampling models and disease surveillance; see [1], [2], [4] and [10].

1.3. Further notation and conditions (a)-(b) A kernel (or a random probability measure) on (S, B) is a collection α = {α(x) : x ∈ S} such that α(x) ∈ P for each x ∈ S and the map x 7→ α(x)(B) is measurable for fixed B ∈ B. Here, α(x)(B) denotes the value taken at B by the probability measure α(x). Let σ0 ∈ P and α a kernel on (S, B). In the sequel, σ0 and α are such that: (a) σ0 is a stationary distribution for α, namely, Z σ0 (B) = α(x)(B) σ0 (dx)

for all B ∈ B;

(b) There is a set A ∈ B such that σ0 (A) = 1 and Z α(x)(B) = α(z)(B) α(x)(dz) for all x ∈ A and B ∈ B.

6

Berti et al.

Conditions (a)-(b) are not so unusual. For instance, they are satisfied whenever α is a regular conditional distribution for σ0 given any sub-σ-field of B; see Lemma 6. In particular, conditions (a)-(b) trivially hold if A=S

and

α(x) = δx

for all x ∈ S

where δx denotes the point mass at x. Finally, if x = (x1 , . . . , xn ) ∈ S n and y ∈ S, we write (x, y) to denote (x, y) = (x1 , . . . , xn , y). In addition, for any strategy σ, we let S 0 = {∅},

σ0 (∅) = σ0 ,

σ1 (∅, y) = σ1 (y).

1.4. Content of this paper We aim to develop NSA for c.i.d. data. To this end, we introduce and investigate a class Σ of strategies. Such a Σ is rich enough to model various real situations and (Xn ) is c.i.d. under Pσ for each σ ∈ Σ. Furthermore, when a new observation Xn+1 becomes available, σn+1 can be obtained from a simple recursive update of σn . Each σ ∈ Σ can be described as follows. Fix σ0 ∈ P and a kernel α on (S, B) satisfying conditions (a)-(b). Also, for every n ≥ 0, fix a measurable function fn : S n+2 → [0, 1] such that fn (x, y, z) = fn (x, z, y)

for all x ∈ S n and (y, z) ∈ S 2 .

Given σ0 , α and (fn : n ≥ 0), a strategy σ can be obtained via the recursive equation Z Z n o σn+1 (x, y)(B) = α(z)(B) fn (x, y, z) σn (x)(dz) + α(y)(B) 1 − fn (x, y, z) σn (x)(dz) for all n ≥ 0, B ∈ B, x ∈ S n and y ∈ S. We define Σ as the collection of all the strategies σ obtained as above. The simplest example corresponds to fn (x, y, z) = qn (x), where qn : S n → [0, 1] is any measurable map (with q0 constant). In that case, the recursive equation reduces to  σn+1 (x, y) = qn (x) σn (x) + 1 − qn (x) α(y) (1) for all n ≥ 0, x ∈ S n and y ∈ S. (Here, for the sake of simplicity, we are assuming A = S where A is the set involved in condition (b)). In this specific case, the updating rule is quite transparent: σn+1 (x, y) is just a convex combination of the previous predictive distribution σn (x) and the kernel α evaluated in

Bayesian predictive inference

7

the last observation y. In addition, the weight qn (x) does not depend on y. Note also that σn+1 (x, y) can be written explicitly (and not only in recursive form) as σn+1 (x, y) = σ0

n Y

qi + α(y)(1 − qn ) +

n X

α(xi ) (1 − qi−1 )

i=1

i=0

n Y

qj ,

j=i

where x = (x1 , . . . , xn ) ∈ S n , y ∈ S and qi is a shorthand notation to denote qi = qi (x1 , . . . , xi ). In general, specifying fn and α suitably, various meaningful strategies can be shown to be members of Σ. Some of these strategies are well known and some are new (in the sense that, to our knowledge, they have not been proposed to date). Examples of the former are the predictive distributions of Dirichlet sequences, species sampling sequences and generalized Polya urns. Examples of the latter are the strategies of Sections 5-6. Another nice feature of Σ is that it also includes diffuse strategies, and this fact may be useful to model real situations. We recall that a probability measure is diffuse if it vanishes on singletons, and a strategy σ is diffuse if σn (x) is diffuse for all n ≥ 0 and x ∈ Sn. In addition to introducing Σ, our main contributions are Theorems 4-5 and Theorems 18-20. The former state that (Xn ) is c.i.d. under Pσ for each σ ∈ Σ, while the latter deal with the asymptotics of σn . A few words should be spent on Theorem 18. Let X1∗ , X2∗ , . . . denote the (finite or infinite) sequence of distinct values corresponding to the observations X1 , X2 , . . . If (Xn ) is c.i.d. under Pσ , where σ is any strategy (possibly not belonging to Σ), there is a random probability measure µ on (S, B) such that a.s.

σn (B) → µ(B)

for every fixed B ∈ B

where ”a.s.” stands for ”Pσ -a.s.”; see Subsection 2.1. Theorem 18 states that X a.s. Wk δXk∗ , µ = k

for some random weights Wk ≥ 0 such that

P

Wk = 1, if and only if  lim Pσ Xn 6= Xi for each i < n = 0. k

n

Furthermore, Wk admits the representation n

a.s.

Wk = lim n

1X 1{Xi =Xk∗ } . n i=1

By applying Theorem 18 to σ ∈ Σ, it is not hard to give conditions on fn and α implying that µ is a.s. discrete. Conditions for X1∗ , X2∗ , . . . to be i.i.d. and independent of the weights W1 , W2 , . . . are given as well.

8

Berti et al.

It is worth noting that Theorem 18 holds true for any strategy σ which makes (Xn ) c.i.d. Hence, Theorem 18 extends a known fact concerning exchangeability to all c.i.d. sequences; see e.g. [23]. In addition to the results quoted above, other main contributions of this paper are the examples included in Sections 4-6. In our opinion, these examples should support the fact that Σ is rich enough to cover a wide range of problems.

2. Preliminaries 2.1. Conditional identity in distribution C.i.d. sequences have been introduced in [4] and [21] and then investigated in various papers; see e.g. [1], [2], [6], [7], [8], [9], [10], [18]. Here, we just recall a few basic facts. Let (Gn : n ≥ 0) be a filtration and (Yn : n ≥ 1) a sequence of S-valued random variables. Then, (Yn ) is c.i.d. with respect to (Gn ) if it is adapted to (Gn ) and   P Yk ∈ · | Gn = P Yn+1 ∈ · | Gn a.s. for all k > n ≥ 0. When (Gn ) is the canonical filtration of (Yn ), the filtration is not mentioned at all and (Yn ) is just called c.i.d. From a result in [21], (Yn ) is exchangeable if and only if it is stationary and c.i.d. Let (Yn ) be c.i.d., Gn = σ(Y1 , . . . , Yn ), and µn =

n 1 X δY n i=1 i

the empirical measure. In a sense, the asymptotic behavior of (Yn ) is similar to that of an exchangeable sequence. This claim can be supported by two facts. First, there is a random probability measure µ on (S, B) satisfying a.s.

µn (B) −→ µ(B)

for every fixed B ∈ B.

As a consequence, for fixed n ≥ 0 and B ∈ B, one obtains   E µ(B) | Gn = lim E µm (B) | Gn m

m   1 X P Yi ∈ B | Gn = P Yn+1 ∈ B | Gn a.s. = lim m m i=n+1

 Thus, as in the case, the predictive distribution P Yn+1 ∈ · | Gn can be  exchangeable written as E µ(·) | Gn , where µ is the a.s. weak limit of the empirical measures µn . In particular, for each B ∈ B, the martingale convergence theorem implies   a.s. P Yn+1 ∈ B | Gn = E µ(B) | Gn −→ µ(B). (2)

Bayesian predictive inference

9

Second, (Yn ) is asymptotically exchangeable, in the sense that (Yn , Yn+1 , . . .) → (Z1 , Z2 , . . .) in distribution, as n → ∞, where (Zn ) is an exchangeable sequence. Moreover, (Zn ) is directed by µ, namely k nY o  P Z 1 ∈ B1 , . . . , Z k ∈ Bk = E µ(Bi ) i=1

for all k ≥ 1 and B1 , . . . , Bk ∈ B. The role played by µ is not as crucial as in the exchangeable case, since the probability distribution of (Yn ) is not completely determined by µ; see Example 17. Nevertheless, µ is a meaningful random parameter for (Yn ). In fact, µ(B) is the long run frequency of the events {Yn ∈ B}. Similarly, because of (2), µ(B) can be regarded as the asymptotically optimal predictor of the event {the next observation belongs to B}. And finally, µ is the directing measure of the exchangeable limit sequence (Zn ).

2.2. Stationarity, reversibility and characterizations We first recall some definitions. Let τ ∈ P and α = {α(x) : x ∈ S} a kernel on (S, B). Then: • τ is a stationary distribution for α if Z α(x)(B) τ (dx) = τ (B)

for all B ∈ B;

• α is reversible with respect to τ if Z Z α(x)(A) τ (dx) α(x)(B) τ (dx) = A

for all A, B ∈ B;

B

• α is a regular conditional distribution for τ given G, where G ⊂ B is a sub-σ-field, if x 7→ α(x)(B) i...


Similar Free PDFs