Introduction to the Kalman Filter and its Derivation PDF

Title	Introduction to the Kalman Filter and its Derivation
Author	Brent Perreault
Pages	11
File Size	713.1 KB
File Type	PDF
Total Downloads	224
Total Views	647

Preview

CLICK TO PREVIEW PDF

Summary

Introduction to the Kalman Filter and its Derivation Brent Perreault∗ Concordia College, Moorhead, Minnesota April 19, 2012 Senior Seminar Dr. Oksana Bihun Abstract This paper reviews an important result in estimation theory, now known as the Kalman filter, named after Rudolf E. Kalman. The Kalman f...

Description

Accelerat ing t he world's research.

Introduction to the Kalman Filter and its Derivation Brent Perreault

Related papers

Download a PDF Pack of t he best relat ed papers 

Discret e Kalman Filt er Tut orial Imran Mohammed Angle-Only Filt ering in T hree Dimensions Lyudmila S Mihaylova Adapt ive mot ion est imat ion of a t umbling sat ellit e using laser-vision dat a wit h unknown noise charac… Kourosh Parsa

Introduction to the Kalman Filter and its Derivation Brent Perreault∗ Concordia College, Moorhead, Minnesota

April 19, 2012

Senior Seminar Dr. Oksana Bihun Abstract This paper reviews an important result in estimation theory, now known as the Kalman filter, named after Rudolf E. Kalman. The Kalman filter solves the leastsquares estimation problem recursively, and in a computationally efficient manner. The problem is carefully stated in the second section and is later solved using matrix derivatives. The resulting solution, the linear Kalman filter, is collected into an algorithm that finds a number of applications where one wants to estimate a collection of related, measurable values as data is collected.

1

Introduction

This paper intends to review the theory of the linear Kalman filter algorithm in a way that is accessible to upper-level undergraduates who have not been introduced to estimation theory or prediction and filtering. It can be used alongside the practical review by Welch and Bishop [1] offering the interested reader a straightforward derivation of the filter in the Section 3 based on matrix-theoretic principles developed there. The linear Kalman filter algorithm finds the linear least squares best estimate by going through data one by one. That is, the filter finds the new best estimate for a given set of data once a new measurement is added by using both the new measurement, the old estimate, and some measure of confidence in the old estimate. In this way it solves the problem of least squares estimation in a recursive manner, yielding an estimate of the state each time a new measurement is included. It is the recursive nature of the filter, along with its computational efficiency, that makes it ideal for applications such as auto-pilot navigation, among many other things, where one wants an estimate of the trajectory that is updated every time new position measurements are taken. In this text we will only consider the discrete linear Kalman filter. However, there are a number of extensions of this approach that allow for non-linear estimates ∗

[email protected]

1

and continuous time estimation. Nonetheless, we find this approach to be most appropriate for an introduction and can very easily lead the reader on to more complicated filters. This paper is organized as follows. In the Section 2 we introduce the type of process that we will consider and phrase the problem in terms of the appropriate matrix mechanics. In Section 3 we solve the problem for a set of recursive relations using a matrix gradient approach, which we take care to develop in the first subsection. The final subsection collects the results into an organized algorithm and discusses implementation.

2 2.1

The Problem Set up

The processes that we consider here have a finite number of measurable quantities that we collect in a state vector x. The state vector is assumed to follow simple stochastic Markov Dynamics. The word “Markov” indicates that each subsequent state depends only on the one before it. We say that the process is stochastic because the subsequent values are not deterministic: they do not have to happen the same way every time. Instead, the value of the state vector at time step k, denoted xk , depends on the value of xk−1 by the linear relation xk+1 = Axk + wk

(1)

where A is an n × n matrix that is characteristic of the process, and wk ∈ ❘n is a random vector that is also characteristic of the process. The vector wk is a random vector, meaning each of its components is a random variable. Each component is distributed normally about zero with some variance and is independent of the others. In this way, at each time step k the process will have determined a sequence of vectors xi k0 . Notice that the process evolves only approximately linearly, and the difference wk is known as the process noise. All that we can know about the value that wk at time step k, before time k is the covariance between each pair of it’s components, giving us some idea of how the random variables are related and their variances. This information is conveniently collected in the covariance matrix. Recall that the covariance of any two random variables X and Y is given by ¯ ¯ Cov(X, Y) = E[(X − X)(Y − Y]. (2) R Here the expected value is E[X] = xP (X)dx and the integral is over all possible values of X and it is interpreted as a sum if the domain of X is discrete. Also, ¯ = E[X] is a shorthand for the mean value of the random variable. Note that X the covariance of X with itself is simply the variance of X. In general, for a random vector v ∈ ❘m whose components are vj where j = 1, 2, ..., m, the covariance matrix, C, is defined by the relation ci,j = Cov(vi , vj ).

(3)

Note that we are risking confusion in our notation here since for v we have allow the subscript to index the respective component while for all of our other vectors

2

the subscript represents the current time step. Also note that the covariance matrix is symmetric about its diagonal, whose entries are the variances for each corresponding variable. Now, to model what we find in applications we introduce the vector zk ∈ ❘n , which represents our measurements of the state vector xk . However, to make for the most useful model we also allow for some Gaussian measurement noise vk (not related to v), and for the measurement to be related linearly to the actual state by some matrix Hk , which may change with time. This gives the relation zk = Hk xk + vk .

2.2

(4)

Summary

To summarize, we have the state vector xk , which propagates approximately linearly via the relation xk+1 = Axk + wk , and a measurement of the state which is approximately the same as the state via zk = Hk xk + vk . Each of these random vectors is specified by their respective covariance matrices which we will call Q and R so that we write wk ∼ N (0, Q) vk ∼ N (0, R).

(5)

Here the symbol N indicates the normal (Gaussian) distribution and takes two variables, a vector mean and a covariance matrix.

2.3

Problem statement

The problem that we intend to solve is to find a recursive way to calculate the best linear estimate - the one that minimizes the sum of the square errors. We ˆ k . This is the best estimate will denote our best estimate of the state xk by x based on all of the measurements up to and including zk . Let us denote the set of information determined at the time step k by1 Ik = {zt , zt−1 , ..., z0 , xt , xt−1 , ..., x0 }. It is also useful to consider the estimate of xk using the measurements available at time k − 1, or when the determined information is Ik−1 , which we will denote as ˆ− x k . We may also write this as ˆ− x k = E[xk |Ik−1 ]. This is known as the a priori estimate, as opposed to the a posteriori estimate after the new measurement is available. 1

To avoid confusion with the identity matrix I we denote the information set Ik with lowercase.

3

Next we define the error that it is we want to minimize. The error that we want to minimize is the theoretical difference between our estimate value and the actual value. ˆ k .e− ˆ− ek = x k − x k = xk − x k,

(6)

where these are the a posteriori and a priori errors respectively. We can reduce the problem to minimizing the trace of the following covariance matrices. The a priori covariance matrix and the a posteriori error matrix given respectively by T

− − P− k = E[ek ek ]

Pk = E[ek eTk ].

(7)

Notice that the trace of one of these matrices is the sum of the variances of each element in the respective error vectors. Finally, we write down our solution form. We assume that the a posteriori estimate is determined from the current a priori estimate some linear combination of the most recent residuals. ˆ− ˆk = x ˆ− zk − Hk x x k ). k + Kk (ˆ

(8)

We notice that this is an extension of the usual linear form, with the matrix Kk representing the optimal weight that we want to put on our newest measurement. Here we notice that the difference (ˆ zk −Aˆ xk−1 ) is the matrix version of the familiar residual - the difference between our current estimate based on the other measured data, and the newly measured value. Then the weight tells us how much that difference should change our estimate, which would depend on how confident we are in the a priori estimate. We can then reduce our problem to finding the matrix Kk that minimizes the trace of Pk .

3 3.1

The Solution The matrix gradient

To minimize the trace of the error covariance matrix Pk (8) for each independent element of Kk according to the solution form (8) we must simply set to zero the derivative with respect to each matrix element. To simplify our calculation we will introduce one form of a matrix derivative, the matrix gradient, that will allow us to take full advantage of the matrix algebra we have afforded ourselves.2 Note first that for this subsection we will suspend our use of subscripts to represent the current time step and revert to the notation of linear algebra where two indices on aij are used to indicate the ith row and jth column of the matrix A. 2

This is not how R.E. Kalman originally solved the problem. For his solution see his paper [2].

4

Recall the gradient from vector calculus (we will assume three spatial dimensions).3 ∂ ∂ ∂ ~ (x1 , x2 , x3 ) = x ∇f ˆ1 f +x ˆ2 fx ˆ3 f ∂x1 ∂x2 ∂x3 

∂f ∂x1



   ∂f  =  ∂x   2 ∂f ∂x3

It is a differential operator (like the derivative) that maps a function of three variables to a vector whose elements are the respective derivative of the function. We follow this idea to define the matrix gradient of a single valued function f of m·n variables, all collected in the m×n matrix A to be the matrix whose elements are the partial derivatives of the function with respect to the corresponding matrix element [3].  ∂f  ∂f · · · ∂a∂f1n ∂a11 ∂a12   ∂f ∂f   ∂f  ∂a21 ∂a22 · · · ∂a1n  ∇A f (a11 , a12 , ...) :=  . ..  .. ..   . . .  .  . ∂f ∂f ∂f ∂am1 ∂am2 · · · ∂amn We may also write f (A) to indicate that the function f depends on every one of the elements of A. With this, our definition can be written in terms of the typical element of A as ∂f (A) [∇A f (A)]ij = . (9) ∂aij

We will now use this element notation to develop some matrix-gradient rules for the trace of a matrix. All of our discussion will be for square matrices. 4 Consider, first, two simple examples before we take on the quadratic form that we will use in our application. The matrix B will be considered constant throughout, that is, each of it’s elements is orthogonal to each element of the matrix A, whose matrix gradient space we consider. Theorem 1. ∇A tr(A) = I 3 4

Here the hats (ˆ x) are used to indicate unit vectors rather than estimates. Recall that the trace of a matrix is the sum of its diagonal elements. X tr(A) = aii i

Also, we can write the product of two matrices in typical element notation as follows X air brj , [AB]ij = k

and the transpose AT of a matrix A is [AT ]ij = aji . The subscripts in the sums are assumed to vary over all possible values (1 to n) and ai j is shorthand for [A]ij .

5

Proof. ∂ tr(A) ∂aij

∂ X akk ∂aij k ( 1, if i = j = 0, if i 6= j =

Theorem 2. ∇A tr(AB) = BT Proof. Recall, tr(AB) =

X

akr brk .

k,r

Then we have ∂ tr(AB) ∂aij

∂ X akr brk ∂aij

=

k

∂ = aij bji ∂aij = bji .

Theorem 3. ∇A tr(BAT ) = B Proof. Left to the reader (Do it!). Now, we consider the form that we will encounter, the quadratic form, or a matrix product with two appearances of the matrix A (or it’s transpose). The following theorem will suffice. Theorem 4. ∇A tr(AT BA) = BA + BT A Proof. First we write the trace of the three matrix product in element notation. X tr(AT BA) = aTkl blr ark k,l,r

=

X

alk blr ark ,

k,l,r

where we have extended the transpose notation to the typical element to indicate that it’s indices need to be switched. Then we have ∂ tr(AT BA) ∂ X = alk blr ark . ∂aij ∂aij k,l,r

6

We continue by considering the two cases in which terms survive the sum. The first is when l = i and k = j so that the derivative and the first term alk match. The second is for the second term, r = i and k = j. These two cases lead to two contributing sums in our solution. X X [tr(AT BA)]ij = bir arj + alj bli =

r

l

X

X

bir arj +

r

bTil alj .

l

Finally, we recognize the last two sums as the typical elements of the two matrix products, BA and BT A. We make use of the following corollary in the next subsection. Corollary 5. If B is a symmetric (square) matrix, then ∇A tr(AT BA) = 2BA.

3.2

Deriving the equations

Here we derive the list of equations that together make the Kalman filter. In doing so we follow the approach taken by Chris Semis [4], using the matrix-gradient to obtain the desired optimization. The estimates based on Ik are a posteriori estimates, and those based on Ik−1 are a priori estimates. We assume that ˆ k = E(xk |Ik ) x ˆ− x = E(xk |Ik−1 ). k Also, note that as in the definition of Pk , the expected value of the outer product of a random vector of mean zero with itself yields the respective covariance matrix. E[wk wkT ] = Q E[vk vkT ] = R E[ek eTk ] = Pk T

− − E[e− k e k ] = Pk ,

where the idea of the expected value of a matrix is simply the matrix whose components are the expected values of the respective matrix element ( [E[A]]ij = E[aij ]). We collect here the last two equations that we will use. ˆk. ek = x k − x ˆ− = xk − x e− k k. The problem, then, that we intend to solve is put simply as the following. ˆk = x ˆ k−1 + Kk−1 (ˆ ˆ− minimize tr(Pk ) subject to x zk − Hk x k ). The desired solution turns out to be − T −1 K k = P− k Hk (Hk Pk Hk + R) .

7

(10)

Proof. Let us denote the outer product of a vector with itself yyT as y2 (we will not need the inner product here). We simply start by replacing terms with their definitions to write Pk in terms of the other matrices [4]. Pk (K) = E[ek eTk ] = E[ek )2 ] 2 ˆ− ˆ− = E[(xk − x k )) ] k − Kk (zk − Hk x 2 = E[(xk − (I − Kk Hk )ˆ x− k − K k zk ) ] 2 = E[(xk − (I − Kk Hk )ˆ x− k − K k Hk x k − K k v k ) ] 2 ˆ− = E[((I − Kk Hk )(xk − x k ) − Kk v k ) ] 2 = E[((I − Kk Hk )e− k − Kk v k ) ] 2

T 2 T = (I − Kk Hk )E[e− k ](I − Kk Hk ) + Kk E[vk ]Kk T T = (I − Kk Hk )P− k (I − Kk Hk ) + Kk RKk

In the second to last line we used the orthogonality of e− k and vk . In the last lines we also used that the matrices Kk and Hk take on values at the time step k and thus are constant with respect to the absolute value. Expanding, − − T T − T T T Pk = P− k − Kk Hk Pk − Pk Hk Kk + Kk (Hk Pk Hk Kk + R)Kk .

(11)

Now we we can take the gradient with respect to Kk of the trace of Pk . − T − T T ∇K tr(Pk ) = −P− k Hk − Pk Hk + 2Kk Hk Pk Hk + 2Kk R.

(12)

This result follows from the matrix-gradient rules developed in the previous subsection. Setting this equation to zero yields the result (10). Now we finish the derivation by deriving equations to update our estimates for each matrix and vector that depends on the given time step. To be most clear we will extend our notation for the expected value to explicitly indicate the available information so that E[xt |It−1 ] indicates the expected value given the information Ik−1 . First, the a priori estimate of the state is ˆ− = E[xt |It−1 ] x k = E[Axt−1 + wt−1 |It−1 ] = Aˆ xt−1 . Also, we can obtain a relation for the a priori error covariance in a similar manner. T

− P− = E[e− t et |Ik−1 ] k 2 ˆ− = E[(xt − x t ) |Ik−1 ]

= E[(Axk−1 + wk−1 − Aˆ xk−1 )2 |Ik−1 ] = E[(Aek−1 + wk−1 )2 |Ik−1 ] = E[(Aek−1 + wk−1 )2 |Ik−1 ] = APk−1 AT + Q. In the last line we use the orthogonality of ek−1 and wk−1 .

8

Finally, we can simplify the equation that updates Pk by replacing the optimal Kalman gain Kk . − − T T − T T T Pk = P− k − Kk Hk Pk − Pk Hk Kk + Kk (Hk Pk Hk Kk + R)Kk − − T T − T T = P− k − Kk Hk Pk − Pk Hk Kk + Pk Hk Kk

= (I − Kk Hk )P− k − T where, in the second line, we have used the fact that Kk = P− k Hk (Hk Pk Hk + R)−1 . In the following section we collect all of these results and describe the way that they go together to make for a computationally efficient solution to the recursive least squares problem.

3.3

The filter

The two covariance matrices, Q and R, which describe the random nature of the system, are inputs to the algorithm. We must also know the matrices A and Hk ˆ 0 in some way. for every time step. Moreover, we must initialize both P0 and x ˆ k is given in terms of the last estimate x ˆ k−1 With these inputs, the best estimate x and the new measurement zk by the following series of equations. − T −1 K k = P− k Hk (Hk Pk Hk + R)

Pk = (I − Kk Hk )P− k ˆ− ˆ− ˆk = x x k ). k + K(zk − Hk x

(13)

These equations are known as the update equations, which give the next estimate. Chronologically, the update equations will follow the use of the predict equations, ˆ− which give us our values of P− k and x k in terms of the last estimate [1]. ˆ− = Aˆ xk−1 x k = APk−1 AT + Q. P− k

(14)

This procedure is summarized in Figure 1.

4

Discussion

The linear, discrete Kalman filter just derived can be used to estimate the state of a process for which information is constantly gathered or any problem where one wants to have a state estimate quickly, before all the information is processed. An example of the former is navigation, or weather prediction. In both of these cases we want ongoing (updated) solutions each time data is added. An example of an application where all of the data is immediately available but not all of it is used right away is in the tracking of a particle in a spatial detector. In high energy physics computational efficiency is extremely important because of the high rate at which particles travel through the detector. The a priori state estimate given by the Kalman filter after only a few data points can be used to determine whether a detector hit is close enough to be part of the current track and whether it should

9

Figure 1: A diagram of the the Kalman filter algorithm. The two sets of equations, predict and correct, are applied for each measurement yielding a new best estimate each time. This figure was adapted from a figure by G. Welch and G. Bishop found in [1].

be used to update the track (state) estimate. In this way the filter allows for the two jobs of track recognition and track fitting to be solved simultaneously. The work done in Section 3 is enough to allow the interested reader to see how other estimation problems with multi-dimensional measurements could be solved using the matrix gradient. There are also number of other applications of the matrix gradient in applied math in general, such as in the finite element method (FEM) of computational ...