## Abstract

A new framework is presented for understanding how a nonnormal probability density function (pdf) may affect a state estimate and how one might usefully exploit the nonnormal properties of the pdf when constructing a state estimate. A Bayesian framework is constructed that naturally leads to an expansion of the expected forecast error in a polynomial series consisting of powers of the innovation vector. This polynomial expansion in the innovation reveals a new view of the geometric nature of the state estimation problem. It is shown that this expansion in powers of the innovation provides a direct relationship between a nonnormal pdf describing the likely distribution of states and a normal pdf determined by powers of the forecast error. One implication of this perspective is that when state estimation is performed on a nonnormal pdf it leads to state estimates based on the mean to be nonlinear functions of the innovation. A direct relationship is shown between the degree to which the state estimate varies with the innovation and the moments of the distribution. These and other implications of this new view of ensemble state estimation in nonlinear systems are illustrated in simple scalar systems as well as on the Lorenz attractor.

## 1. Introduction

Numerical prediction through state estimation in the geophysical sciences strives for high resolution and longer cycling intervals and/or assimilation windows. High resolution, which presently refers to resolving the mesoscale, provides the ability to resolve those details of the circulation that are important to making a useful forecast, such as convection along frontal boundaries and the internal dynamics within a tropical cyclone. The cycling interval refers to the time between state estimates. Longer cycling intervals can provide computational benefits by requiring the reading and writing of the state vector (possibly many state vectors in the ensemble context) as well as the matrix algebra of the data assimilation algorithm to be executed less often. The assimilation window refers to the window in time over which observations are allowed to affect a state estimate. Longer assimilation windows allow for the inclusion of a greater number of observations in each window. It has been suggested that another useful feature of long assimilation windows is that to be consistent with the observed life cycle of baroclinic waves a window of as long as 5 days is necessary (Fisher et al. 2005; Lorenc and Payne 2007).

Designing state estimation systems for use with high resolution and relatively long cycling intervals and/or assimilation windows will require consideration of a nonnormal probability density function (pdf) describing the likely distribution of states. This is because the effects of nonlinearity appear most strongly as the scale of meteorological phenomena decreases and the time scale over which one is interested increases (Zhang et al. 2003; Tan et al. 2004; Reynolds and Rosmond 2003). This implies that quantities such as the third moment of the distribution become important and through the nonlinear model the second moment at forecast time becomes dependent on the moments higher than the second at the analysis time (Miller et al. 1994).

Previous work toward state estimation techniques for nonlinear modeling systems has been discussed by Kushner (1967), Jazwinski (1998, their chapter 9), Anderson and Anderson (1999), and Julier and Uhlmann (1997, 2004). However, in the geophysical sciences work toward explicitly accounting for the nonnormal aspects of the distribution of possible states within an ensemble data assimilation framework has emphasized particle filtering [see van Leeuwen (2009) for a comprehensive review]. In particle filtering one estimates the probability that a particular realization is the true state by comparing against observations. Given these probabilities (sometimes referred to as weights) one can do such things as make a state estimate based on the mean of the pdf or randomly sample from the pdf to generate an ensemble consistent with observational and prior uncertainties. While this method is capable of dealing with any arbitrary nonnormal pdf it does so in a way that does not make transparent the role played by the underlying nonnormal aspects of the pdf. In addition, there appear to be significant obstacles to the application of particle methods to the high-dimensional systems in the geophysical sciences (Snyder et al. 2008). Last, the algorithmic structure is relatively different from the explicit linear algebra involving the second moment required of traditional Kalman (1960) filtering. This makes it difficult to understand from the particle filtering perspective how and why the moments higher than the second affect a state estimate.

Another way to get nonnormal distributions is through nonlinearity in the observation operators or the nonnormality of the observation likelihood. In this case the prior distribution of possible states before observations are considered may be normal; however, the pdf after the observations are considered (posterior) may be nonnormal. Two examples of ensemble filters that are aimed at the situation where the observation operator is nonlinear or the observation likelihood is nonnormal (but the prior distribution is normal) are the work of Zupanski (2005) and Fletcher and Zupanski (2006). Both of these works are directly related to variational methods (e.g., Lorenc 1986; Navon et al. 1992; Courtier et al. 1994) through the statement of a cost function, which defines both the posterior distribution and subsequently that the solution technique is a minimization algorithm. The use of a minimization technique is fundamentally different from that of Kalman filtering as the minimization of a cost function aims at finding the mode of the posterior distribution while the Kalman filter is attempting to find its mean. Consequently, it is important to note that the mean and the mode of the posterior distribution will typically be different because of the generally nonnormal nature of the posterior. In the work to be presented here we will present a theory for both nonnormal prior distributions and observation likelihoods that converges on the mean of the posterior for arbitrary posterior distributions and reduces to the traditional Kalman filter for normal posterior distributions.

The main goal of the work to be presented here is the presentation and derivation of a framework for the basic understanding of the role of the moments higher than the second on ensemble data assimilation. The derivation we present will be such that 1) it is optimal for arbitrarily distributed errors, 2) it is transparent in its use of the moments of the pdf, and 3) it is algorithmically similar to the matrix operations of the traditional Kalman filter. To this end we rederive the state estimation problem starting from Bayes’ rule. We make the standard assumption that the truth is unknowable and rather than attempting to find it we focus our attention on estimating the mean of the conditional pdf given observations and a previous guess. By expanding the expected forecast error in a polynomial series in powers of the innovation (the difference between today’s observation and our guess of that observation) we explicitly show how the moments of the prior affect the computation of the mean of the posterior. We emphasize that we will make no assumptions on the structure of the prior distribution. This formulation of the problem reveals how the properties of this state estimate vary with the innovation and how this is related to the geometry of the mean of the posterior.

In section 2 we derive the theory for state estimation using polynomial expansions in the innovation. In section 3 we show how truncating the expansion derives linear and quadratic estimation equations. In section 4 we apply the theory of sections 2 and 3 to simple scalar systems with normal and nonnormal pdfs and within the Lorenz (1963) attractor. Section 5 closes the manuscript with a summary of the most important results, conclusions, and points out avenues of future research presently being investigated.

## 2. Theory

### a. Posterior distribution

We imagine the true state **x** to be an *N* vector that is drawn from a distribution whose pdf we label . In addition, we will collect the sum total of all previous information about this true state in a previous estimate we label **x*** _{f}*. At the present time we have a

*p*vector of observations

**y**such that we may use Bayes’ rule to obtain a density that describes the combined knowledge of the likely distribution of states:

The density describes the conditional distribution of observations given a particular value of the truth (observation likelihood), while describes the conditional distribution of truth given a previous estimate of the true state; this last density will hereafter be referred to as the “prior.” The density describes the conditional distribution of the truth given a particular observation and previous estimate; this density will hereafter be referred to as the “posterior.”

### b. Posterior mean

In this section we will make no assumptions on the structure of the prior. The derivation below will assume that the observation likelihood is symmetric but not necessarily normal. This assumption that the observation likelihood is symmetric is made only for ease and clarity of presentation; the derivation can easily be completed for arbitrary observation likelihoods. Similarly, we will assume throughout that the observation operator is linear though this assumption is also easily relaxed and does not change the essential results [e.g., see Houtekamer and Mitchell (2001) for a discussion of linear versus nonlinear observation operators in an ensemble framework.].

A standard estimation technique for the true state given the posterior density is to find its mean:

This estimate of the true state has the property that it is unbiased and that it minimizes the posterior error variance (Jazwinski 1998). In ensemble-based estimation techniques it is often assumed that our previous estimate of the truth is the mean of the prior distribution . Note that a random draw from the prior distribution behaves as , where is a random variable with zero mean. This change of variable allows (2.2) to be written as

Equation (2.3) shows that the “correction” to the mean of the prior that produces the mean of the posterior is the expected error given the distribution of errors conditioned on today’s observation and prior mean. A point that we will return to throughout this work is that the posterior mean will, in general, be a nonlinear function of today’s observation and prior mean whenever the distribution is nonnormal.

Without loss of generality we may consider the right-hand side of (2.3) as simply an unknown function of today’s observation and prior mean. By taking this perspective, (2.3) may be written concisely as

The vector function **f** is assumed smooth and is the object of central interest. One way to understand the structure of **f** is through an expansion in terms of the observation about the prior estimate of that observation (Jazwinski 1998, 340–346):

where the innovation ; the matrix is the observation operator that takes the *N*-dimensional state vector into the *p*-dimensional observation space; and the vector and the matrix coefficients of the expansion are to be determined below.

The unusual vector notation represents a vector of length *p*^{2} such that

where the symbol “⊗” refers to the Kronecker product and *υ _{i}* is the

*i*th element of the innovation vector. A similar representation applies to the

*p*

^{3}vector and so on.

The entire polynomial expansion in (2.5) may formally be represented as

where

Equation (2.7) describes how to calculate the mean of the posterior using the “linear” update involving the gain matrix operating on the predictor vector , where comprises an infinite number of predictors formed from today’s innovation. What is interesting about this form of the state estimation problem is that it is now homomorphic to the traditional Kalman (1960) update equation for the mean of the posterior distribution under the assumption of a normal prior. We show in appendix A that this structure is fundamental to the process of predicting powers of the error as this procedure is entirely consistent with a normal distribution whose mean (and mode) is the sought after posterior mean of the nonnormal posterior in question. Last, we note that the formulation of the problem in (2.7) is very similar to nonlinear least squares estimation techniques [see Golub and Meurant (2010) for a review of least squares estimation techniques].

### c. Determining the coefficients of the expansion

To determine the vector we note that a random draw from the posterior distribution is , where ** ɛ** is a random variable with zero mean. Therefore, an equation for the “error” in estimating the true state as the posterior mean may be obtained by subtracting

**x**from both sides of (2.7):

Because the expected value of ** ɛ** and must vanish this implies that

where

and the notation 〈 〉 represents the expected value of a random variable. Note that if we assume that this implies that we have assumed that the observation and the mean of the prior is accurate in so far as the distribution of truth about them is unbiased. If this were not the case we could simply include the bias in . In any event, (2.11) shows that when the innovation vanishes (i.e., an observation happens to equal the estimate of that observation by the prior mean) then the difference between the prior and posterior mean is equal to the gain operated upon the expected value of the predictor vector. Given (2.11) we may now rewrite (2.10) as

where , which now clearly has the property that the errors have zero mean.

The standard technique to derive the minimum error variance estimate is to find the gain matrix in (2.13) that minimizes the trace of the *expected* posterior error covariance matrix:

where is the posterior error covariance matrix [see appendix B and (B.10)] and is the pdf that describes the distribution of innovations. Upon making use of (2.13) and (2.14) the *expected* posterior error covariance matrix may be calculated as

where

with **0**_{1} the *p* × *p* zero matrix and **0**_{2} the *p* × *p*^{2} zero matrix. The matrices in (2.16a)–(2.16d) are defined as

and is the matrix operator that takes an *N*^{2} vector into the *p*^{2} predictor space and copious use of the identity, , has been made. Note that while using the previous identity leads to a much clearer and deeper presentation, for computational reasons one would likely leave these expressions in terms of the left-hand side of the identity during the evaluation of these terms because the length of the state vector in observation space is likely to be much shorter than that of the state vector in state space. Similarly, the use of this identity would not be warranted for nonlinear observation operators. In the case of nonlinear observation operators one would simply calculate the covariance between the ensemble members after being operated on by the nonlinear observation operator and the variables of interest, and thereby without recourse to the matrices and .

In (2.17a), is a square matrix listing the necessary second moments of the prior distribution. In (2.17b), is a rectangular matrix listing the necessary third moments of the prior distribution. In (2.17c), is a square matrix listing the necessary fourth moments of the prior distribution. In (2.17d), is a square matrix listing the necessary fourth moments of the observation likelihood. Note that even when the observation error covariance matrix is diagonal is not diagonal. The matrices are square matrices that represent various combinations of observation error covariances and forecast error covariances.

By minimizing the trace of the *expected* posterior error covariance matrix (2.15) with respect to the matrix , one may determine an expression for that minimizes the expected error variance:

It is important to remember, however, that in general and, as shown in appendix B, the error covariance as a function of innovation must satisfy

whenever the observation likelihood is normal. Equation (2.20) reveals the relationship between the slope of the posterior mean as a function of the innovation and the posterior error variance of the observed variables. This geometric interpretation of the error covariance as a function of the innovation may be extended to moments higher than the second by taking further derivatives.

## 3. Truncation of the polynomial expansion

In practice it is impossible to calculate all of the moments required in the matrix . To make progress one must truncate the polynomial series in (2.5) and derive an approximation to (2.18) and (2.19). The resulting estimation equations illustrate the relationship between traditional linear in the innovation estimation equations (e.g., Kalman 1960) and those that make use of higher powers of the innovation.

### a. Linear filter

By truncating the polynomial expansion in (2.5) at the linear term and minimizing the trace of the expected error covariance matrix we find

where it should be recognized that we have derived the Kalman (1960) and Kalman and Bucy (1961) filter, which in geophysics is commonly implemented using ensemble methods and referred to as the EnKF:

where

Note that the linear filter is incapable of estimating the constant term **f**_{0} of the expansion and therefore the estimate of the mean of the posterior for vanishingly small innovations is simply that of the mean of the prior. As we will show, this is a good assumption for symmetric prior distributions, but not for strongly skewed prior distributions.

It will prove useful below to note that given the square root of the error covariance matrix of the prior the linear update equation for the estimate of the posterior mean may be written as

where the weight for each member of the square root of is

As shown in appendix B there is a direct relationship between the slope of the posterior mean with respect to the innovation and the error covariance of the posterior. For a linear filter this relationship simply recovers

The relationship in (3.5) has recovered an alternate form of the Kalman gain. Note that the analysis error covariance in (3.5) is simply that of the expected analysis error covariance in (3.2b). Similarly, there does not exist any curvature in the estimate of the posterior mean in the linear filter and therefore no estimate of the posterior third moment can be made. In the filter to be presented next we will see that including the next order correction allows for the estimation of the posterior second and third moments.

### b. Quadratic filter

By truncating the polynomial at the quadratic term and minimizing the trace of the expected error covariance matrix we find

where

In (3.6a)–(3.6c) we can see that the inclusion of the quadratic term has not just added a new term but also corrected the previous terms. Hence, one may think of this procedure of adding additional terms and finding a gain matrix that minimizes the expected error covariance as developing an “expansion” for each of the terms in the exact gain in (2.18). The linear filter of the previous section consists of the lowest-order terms of the expansion of the coefficients in (2.5). It is important to recognize however that for vanishingly small innovations and a prior distribution with a significant third moment that the leading term of the complete expansion in (2.5) is **f**_{0}. Hence, in the limit of vanishingly small innovation the linear filter (3.2a) fails to include the largest term of the complete expansion (2.5), while the quadratic filter makes an *estimate* of **f**_{0}.

where . Note that (3.7a) and (3.7b) are simply the linear filter of section 3a with a correction term that is proportional to the third moment of the prior distribution. Hence, if the prior distribution is perfectly symmetric the quadratic filter reduces to the linear filter. In this case it would prove more effective to truncate the polynomial expansion at the cubic term in order to retain a correction to the linear filter. In addition, note that the expected error variance (3.7b) is that of the linear filter (3.2b) with a new term that for all nonzero third moments reduces the trace of the analysis error covariance to less than that of the linear filter.

The term in square brackets on the right-hand side of (3.7a) is the “innovation” in squared predictor space and may be written as

where is a matrix whose columns are the “squares” of the ensemble perturbations (i.e., the *i*th column of is and **ɛ*** _{fi}* is the

*i*th prior ensemble perturbation). Equation (3.8) is the difference between the perturbation of the squared innovation around its mean and the prediction of that perturbation squared innovation by a linear combination of the squared columns of the square root of weighted by the linear estimate. Hence, if the linear weights are sufficient to perfectly predict the squared innovation then the quadratic filter again reduces to the minimum variance estimate of a linear estimator. To the degree that the linear weights do not predict the squared innovation vector the state estimate is modified away from that of the linear estimate.

The quadratic filter may be written as weights of the prior ensemble:

where

Equation (3.9) clearly shows that the quadratic filter is simply a correction to the linear weights that takes account of moments higher than the second. In fact, this also implies that the entire polynomial expansion in (2.5) may equally well be interpreted as an expansion of the weights in (3.9). This interpretation of the expansion in (2.5) is interesting because the expansion is then entirely represented in the potentially much smaller observation/ensemble subspace. Hence, the computational advantages of serial filtering (Houtekamer and Mitchell 2001; Bishop et al. 2001), observation volumes (Ott et al. 2004), or variational methods (Buehner 2005) can be realized within the polynomial expansion approach. Work in this direction will be reported in a sequel.

For the quadratic filter the slope of the posterior mean with respect to the innovation allows for the estimation of the posterior error covariance matrix as a “linear” function of the innovation. Taking the derivative of (3.7a) with respect to the innovation obtains

where

and is the *p* × *p* identity matrix. By comparing (3.11) to (3.5) one can see that the lowest-order variation in the analysis error covariance because of drawing a particular innovation is due entirely to the third moment of the prior. Equation (3.11) differs from (3.5) even when today’s innovation happens to be equal to zero. This is because the slope of the linear term is altered by the third moment of the prior in the quadratic filtering framework.

We may find the leading order estimate of the third moment of the posterior by taking another derivative of (3.11) and using the notation of appendix B:

where

and the *p* × *p* submatrices within (3.14) are realizations of the *elementary matrix* defined with elements *e _{mn}* with the property that when

*m*=

*i*and

*n*=

*k*and all other elements are zero. Note that in the first matrix on the right-hand side of (3.14) the second index advances first and all possible elementary matrices are realized. In the second matrix on the right-hand side of (3.14) the

*p*×

*p*identity matrix appears at each location where

*i*=

*k*. Note that is block-diagonal with the matrix coefficient of the quadratic term of the expansion along the diagonal. Because we began from a quadratic estimator, (3.13) is independent of the innovation even though the true third moment of the posterior is likely to vary with the innovation. To obtain a linear estimate of the third moment of the posterior we would need to extend the quadratic filter to the cubic term in (2.5).

## 4. Illustrative examples

### a. Scalar systems

For concreteness we will focus on two different priors: 1) a normal distribution [i.e., where ] and 2) a chi-square distribution with one-degree of freedom [i.e., , where ]. For simplicity the observation error will be drawn from . The matrices in the linear and quadratic filters of section 3 all become scalars in this case and will be evaluated analytically from known formulas.

These two priors have been carefully chosen to illustrate what we believe are typical characteristics of pdf’s in linear and nonlinear systems. The normally distributed prior has been chosen because it is the canonical model for linear systems. The chi-square prior has been chosen as a model for a nonnormal pdf, such as what would be found in a nonlinear, forced-dissipative system [such as Lorenz (1963); see section 4b]. This is because this pdf has the property that some states are forbidden (zero probability of occurrence) consistent with an underlying attracting manifold and subsequently that the mean of the posterior is “curved” as a function of the innovation. As we shall see this leads to the posterior error variance being a nonlinear function of the innovation. Again, we emphasize that we feel that the curvature in the posterior mean is an important distinguishing characteristic between linear and nonlinear systems.

In Fig. 1 is shown the posterior distribution for the two priors mentioned above as represented through a scatter diagram of random draws from the true distribution as a function of the innovation. One distinguishing characteristic of a prior that is normal is that the mean of its posterior is a linear function of the innovation (Fig. 1a). Another distinguishing characteristic of a prior that is normal is that the posterior error variance is constant (Fig. 1b) and hence equal to its expected posterior error variance . In contrast, the mean of the posterior distribution is curved as a function of the innovation (Fig. 1c) for the case with the chi-square prior. Note that the distribution of truth values scattered around this mean is nonuniform, as shown by the error variance plotted in Fig. 1d, and therefore not generally equal to its expected posterior error variance . Similarly, shown in Fig. 1d is the third moment of the posterior in Fig. 1c. One can see from the curves of the second and third moments that the relationship described in appendix B is satisfied. One may see this by simply noting that the curvature in the posterior mean as a function of the innovation is most acute where the third moment is large and hence there is a direct relationship between the third moment and the second derivative of the posterior mean. This is consistent with the polynomial expansion and the quadratic filter in that it implies that the first step in getting the curvature correct requires consideration of the prior third moment.

#### 1) Estimating the posterior mean

Application of the quadratic filter in (3.7a) to the normal prior in Figs. 1a,b results in a state estimate that is identical to that of the linear filter in (3.1a). This is because in this case, which reduces the quadratic filtering equations to exactly that of the linear filter. Because the posterior mean is in fact a linear function of the innovation for a normal prior, the linear filter equations are capable to calculate exactly the true mean and error variance of the posterior.

In contrast, application of the state estimation techniques of section 3 to the chi-square prior in Figs. 1c,d obtains quite different estimates of the mean and error variance of the posterior (Fig. 2) for a linear versus quadratic estimate. Note that both the linear and quadratic estimation techniques provide a reliable (albeit different) estimate of the mean of the posterior for small innovation. The largest errors are obtained for large innovation consistent with what one would expect from the truncation of a polynomial expansion. In other words, the effect of including additional terms in the expansion is to keep the estimate reliable over a larger range of innovations. An obvious way to define the range of innovations for which an estimation technique is reliable is to require its posterior error variance to remain smaller than its prior error variance. In Fig. 2b it can be seen that truncating the polynomial expansion at the linear term is a reliable method over a range of approximately **v** equal to −3 to 5, while truncating the polynomial expansion at the quadratic term is a reliable method over the much larger range of approximately **v** equal to −5 to 9. This behavior that the linear estimator is accurate only over a limited range of innovations is clearly one reason why operational forecasting centers find it useful to discard large innovations. Note, however, that large negative innovations in the system defined by the chi-square prior actually result in the mean of the posterior having smaller error variance than when compared to the error variance for **v** = **0**. This shows that a technique that can effectively make use of relatively large innovations is potentially much more accurate on those days when large innovations are realized.

A linear estimator obtains an estimate of the function **f**_{0} = 0 and therefore its estimate at **v** = 0 is simply that of the prior mean. In contrast, because the quadratic filter is capable of obtaining a nonzero estimate of the function **f**_{0} this allows the quadratic filter to obtain a smaller error variance even when **v** = 0. The posterior error variance when **v** = 0 is 0.45 for the quadratic filter while the linear filter obtains 0.57. Similarly, the linear filter finds an expected posterior error variance of 0.67 and for the quadratic filter the expected posterior error variance is found to be 0.51. This shows that when weighted toward small but likely innovations, the linear estimator reduces the error variance of the prior by approximately 83%, while the addition of the quadratic term reduces the error variance by another 10%. Because the first two terms of the polynomial series account for 93% of the expected posterior error variance it proves reasonable to truncate at the quadratic term.

#### 2) Sampling from the posterior

So far our focus has been upon estimating the posterior mean. An equally important part of employing ensemble methods is the generation of the ensemble such that it correctly samples the posterior distribution. Here we illustrate the application of the framework described in the previous section to distributions whose posterior error variance is a function of the innovation. Our focus will be on the chi-square prior because, as shown in Figs. 1a,b, the error variance of the normal distribution is not a function of the innovation and therefore application of traditional techniques in the linear and quadratic filters precisely recovers the true posterior distribution. It is the nonnormal aspects of the prior that lead to issues with the generation of a random sample consistent with the posterior distribution.

We have chosen to compare the linear and quadratic filtering frameworks using “perturbed observations” because of its popularity in ensemble data assimilation in the geophysical sciences and also because it is much simpler in the quadratic filtering framework to implement than the deterministic or square root formulations [Please see Tippet et al. (2003) for a comprehensive review of square root filter algorithms for linear filters]. A standard method for the generation of an ensemble that approximates random draws from the posterior distribution is to write (2.13) in the linear filtering framework:

This method is referred to as perturbed observations (Houtekamer and Mitchell 1998; Burgers et al. 1998; Evenson 1994) and forms the ensemble by drawing an observation error from and a forecast error from the prior ensemble, evaluating (4.1), and adding the result to the calculated from (3.2a). This method has the property that for all innovations **v** one obtains an ensemble whose error variance is equal to the expected posterior error variance in (3.2b).

Note that because we have assumed that the observation is a measurement of one member of the prior , using this expression in the innovation leads to

This definition of the innovation adds the observation error to the forecast error, which is not typical. However, because the members of the ensemble are random draws from the prior , it is self-consistent to include a minus sign in front of the gain as shown in (4.1). Because the distribution is symmetric this results in precisely the same ensemble distribution from (4.1) as traditional perturbed observation algorithms. The point here is that in *linear* filter algorithms the sign of the innovation within the perturbed observation algorithm can self-consistently be represented either way. However, in the quadratic filtering framework careful consideration of the structure of the innovation must be made to maintain the correct signed relationship between the third moment of the innovation and the third moment used in the quadratic gain term. These considerations show that the self-consistent generation of an ensemble in the quadratic filtering framework can be accomplished as

where we again draw an observation error from and a forecast error from the prior ensemble, use (4.2) to evaluate (4.3), and add the result to the calculated from (3.7a). Identical to the perturbed observations framework in the linear filter this method also has the property that for all innovations **v** one obtains an ensemble whose error variance is equal to the expected posterior error variance (3.7b).

Figures 3a,b show the perturbed observations ensemble for the linear and quadratic estimation techniques. Recall that these figures are intended to be representations of the posterior distribution in Fig. 1c. Note that both methods produce far too many ensemble members that sample the region of zero probability of **x** < 0. Note also that because both methods create an ensemble whose error variance is that of the *expected* posterior error variance, and not the posterior error variance as a function of the innovation, the dispersion of the members around their mean appears quite uniform as a function of innovation. Last, the true posterior distribution in Fig. 1c has a strong third moment for small innovation while the ensemble distributions in Fig. 3 both have a very small third moment as can be seen by the symmetry of the distribution of the ensemble members around their mean. This excessive symmetry or smoothing of the shape of the pdf through ensemble generation by perturbed observation algorithms has been discussed previously by Lawson and Hansen (2004). This excessive symmetry of the ensemble’s estimate of the posterior is particularly unsatisfactory when we consider that the biggest error in our ensemble’s estimate of the third moment occurs where the innovation is small and hence where we might expect an expansion in the innovation to perform well. These issues arise not because of the framework of the expansion but because we required the ensemble to satisfy its *expected* error variance (and expected higher moments) and not its error variance as a function of innovation.

This symmetry of the ensemble’s estimate of the posterior distribution is a serious issue for obtaining an optimal filtering system when nonlinearity is such that the posterior mean contains substantial curvature. As shown by Miller et al. (1994), the moments of the posterior distribution higher than the second, after being propagated by the nonlinear model, will affect the second moment of the subsequent prior. This will then lead to an inaccurate prior that will adversely affect the performance of the linear and quadratic filtering equations in the next data assimilation cycle.

Obviously it would be far better for the posterior ensemble distribution to match its true error variance in Fig. 2b and to preserve its higher moments. At present we have not been able to derive a set of equations that produces an ensemble that delivers a match to the true posterior error variance and also its higher moments as a function of innovation. We have however already presented a technique for determining an estimate of the true error variance at observation locations ( appendix B). In Fig. 4 we compare the true error variance as a function of the innovation to that estimated by the derivative technique of appendix B. Similarly, prior inflation techniques, such as Anderson (2007), though generally discussed in the context of sampling error, also compensate for the fact that the error variance of the state estimate as a function of the innovation (Fig. 2b) should typically be larger than the expected error variance and does so by using information derived from observations.

### b. Lorenz attractor

In this section we will apply the previously developed theory to the update of the posterior mean and distribution of the ensemble. The main goal here is to understand the behavior when observing only a portion of the state-vector and updating unobserved variables.

We use the simple and well-known Lorenz (1963) equations:

where the parameters will be . The system (4.3a)–(4.3c) will be solved using the explicit Runge–Kutta (4,5) solver (ode45) found in recent versions of MATLAB.

We perform an integration starting from a point (initial condition) off the attractor and integrate long enough (ten units of time) that we are assured that we have arrived on the attractor. We take this new state and perturb it 100 000 times with normally distributed random noise with mean 0 and variance 0.01 and subsequently integrate each perturbed state for 1 unit of time. We chose one unit of time because we felt that this was sufficiently long that the ensemble with such small initial variance was now entirely lying on the attractor, but not so long that the distribution of members spread unrealistically far from each other in the context of ensemble data assimilation. Next, we assume that this ensemble is the true prior distribution and randomly observe the variable *z* of members of this prior with observation error variance of 0.1. This procedure determines the posterior distribution as a function of the innovation and is plotted in Fig. 5. In Fig. 5c is shown the posterior distribution for the observed variable *z*. Note that the posterior of the observed variable is the most linear (but not exactly) as a function of the innovation. This property is easily explained by writing the formula for the innovation using the definition of the observation , which clearly shows that the innovation is generally linear in the truth. However, this property is not typically true for sufficiently large innovation as can be seen in Fig. 5c and even in Fig. 1c. This linearization for small innovation is a general property of the posterior for observed variables and hence is a good reason to observe every point in the state vector if possible. In the unobserved variables (Figs. 5a,b) the posterior distribution is far more curved as a function of the innovation. In the unobserved variables we see the previously described issues of curvature in the mean of the posterior as well as the error variance about that mean depending on the innovation. Note, however, that the relationship between the strong curvature in the mean in, for example, Fig. 5a and the third moment is now between the state variable *x* and the “square” of the observed variable [i.e., the third moment in question is ].

#### 1) Estimating the posterior mean

Here we will illustrate the differences between the linear and quadratic filtering frameworks in the context of the posterior distributions of the Lorenz model. Figure 5 shows the true mean of the posterior as well as the estimates of that mean from the linear and quadratic filtering frameworks. The curvature of the posterior mean is now clearly related to the underlying attracting manifold of the Lorenz model. Because of this curvature the linear filtering framework has difficulty estimating the posterior mean of the unobserved variables for all but the smallest innovations. This issue with the linear filtering framework can be seen in the plots of the error variance about the linear filtering estimate of the mean in Figs. 5d,e. Similar to the chi-square model of the previous section we can see that the effect of including more terms in the expansion is to make the estimate of the mean of the posterior distribution reliable over a wider range of innovations. In addition, the estimation of **f**_{0} by the quadratic filter makes a clear difference in the estimation of the mean of the posterior for the *x* variable in Fig. 5a. This can be seen in the significant difference in the estimation of the mean for small innovation.

#### 2) Sampling from the posterior

We again use the perturbed observations framework in (4.1) and (4.3) to attempt to sample from the posterior distribution. The result of this is plotted in Fig. 6. Because the most noteworthy differences can be illustrated with simply the variable *x* we show only that variable. The estimated posterior distribution for the variable *y* is very similar to that of *x* shown in Fig. 6. In contrast, the observed variable *z* has the best posterior estimate because of its significant linear characteristics (not shown). The dominating characteristic of the unobserved variables is that the structure of the estimated posterior distributions as a function of the innovation is excessively smooth. For both the *x* and *y* variables the inaccurate posterior mean of the linear filter leads to the location where the dispersion of the ensemble members is centered being located at far too large of *x* and *y* values. For the quadratic filter the curvature of the mean of the distribution keeps the ensemble members appropriately centered though similar to the linear filter the dispersion about the mean is still far too uniform. Note also that because we have insisted that the ensemble have uniform variance as a function of innovation, the relationship between the posterior error variance and the posterior third moment in (B.12) is not appropriately related to the first two moments of the distribution.

In Fig. 7 we show the estimate of the posterior error variance from taking the derivative of the linear and quadratic estimation equations with respect to the innovation. Because we have one observation in this experiment the derivative technique of appendix B obtains an estimate of a single column of the posterior error covariance matrix as a function of the innovation. Recall that in the linear estimation technique this column is equal to the expected value of the analysis error covariance for a linear estimator. As can be seen in Fig. 7 the quadratic framework does a fairly good job of estimating the true posterior error covariance over a wide range of innovations.

## 5. Summary and conclusions

A framework has been constructed through which the effect of the moments of the prior on the mean of the posterior can be clearly understood. This framework relies on the fact that the mean of the posterior can be written as a smooth function of the innovation and does not require any assumptions on the structure of the prior or the observation likelihood. This function was then expanded in a polynomial series in the innovation to relate the structure of this function to the moments of the prior. The result of this procedure was the determination of a “linear” update equation for the mean of the posterior using powers of the innovation.

The traditional Kalman filtering equations have been shown to be consistent with a dominant term in the expansion. The expansion shows that the linear filter is an accurate approximation to the posterior mean when the prior distribution is symmetric, but not necessarily normal, and the innovation is not particularly large. In this case, the expansion is relatively accurate after truncation to the term linear in the innovation, which is of course the traditional Kalman update equation for the posterior mean.

Upon including additional terms in the expansion a filter quadratic in the innovation is derived. This filter is quadratic in the sense that it uses square products of the innovation as additional predictors to increase the accuracy of the estimate of the posterior mean. The quadratic filter was shown to be able to produce state estimates that better follow the curvature of nonnormal posterior distributions when compared against traditional linear filtering techniques. In addition, this new filter is also capable of making a correction to the prior mean in situations where the innovation is identically zero by making use of information derived from the third moment of the prior distribution.

This polynomial expansion framework was applied to simple models of nonnormal pdfs. It was shown that the usefulness that is derived by increasing the number of predictors through powers of the innovation is to make the state estimate more reliable over a larger range of innovations. This behavior is most strongly realized in situations where the posterior distribution is highly curved as a function of the innovation. This was shown to be due to the curvature in the posterior distribution being directly related to the moments higher than the second. This behavior suggests that using the additional predictors suggested by the framework of the polynomial expansion should require fewer innovations to be discarded during the assimilation process and hence a better estimate of the posterior mean. On the Lorenz (1963) attractor it was shown that the posterior mean of the unobserved variables had the largest resulting curvature as a function of the innovation, and subsequently the largest improvement from the quadratic innovation technique is found in the unobserved variables. This result that the largest improvement from the quadratic innovation technique is found in the unobserved variables has recently been seen in preliminary experiments in a high-dimensional nonlinear model and will be reported on in a later study.

The generation of the ensemble was also explored under the polynomial expansion framework. It was shown that the ensemble generation technique referred to as “perturbed observations” produces a collection of states that satisfy the expected posterior error variance. This turns out to be a significant issue for posterior distributions that are highly curved as a function of the innovation because these distributions have posterior error variances that can differ substantially from their expected values. It is important to note that this issue with the error variance is not solely a peculiarity of the perturbed observations algorithm, square root filters (e.g., Anderson 2001; Bishop et al. 2001) also strive to require the variance of the ensemble to be that of the expected error variance rather than the error variance as a function of the innovation. This issue with the error variance of the ensemble’s representation of the true posterior distribution has previously been dealt with implicitly using inflation techniques (e.g., Anderson 2007). We have shown here that an alternate and more direct technique can also be used that makes use of the fact that the true posterior error variance is related to the derivative of the state estimate with respect to the innovation.

Similarly, the estimate of the posterior distribution during ensemble generation through perturbed observations was shown to result in an overly symmetric distribution and hence the wrong third moment. This implies the possibility that cycling over short intervals may lead to a significant amount of symmetry being imposed upon the ensembles estimate of the distribution by the ensemble generation algorithm and a subsequent poor estimate of the third moment. While longer cycling intervals may allow for the development of asymmetric features in the prior through nonlinear processes it also increases the error variance of the prior leading to the possibility of filter divergence.

The development of a localization strategy and the determination of the appropriate ensemble size are likely to be the most significant issues with the application of the quadratic filter. Preliminary results with a high-dimensional nonlinear model suggests that an ensemble size of at least 100 is necessary for properly estimating the third and fourth moment matrices. The exploration of approximations for computational expediency and the examination of the effects of sampling error, localization, and inflation in the framework of the polynomial expansion in a high-dimensional system are presently being examined and will be reported on in a later study.

## Acknowledgments

Useful and encouraging conversations on this subject with Craig Bishop, Jim Hansen, Justin Mclay, Jon Moskaitis, and Chris Snyder are greatly appreciated. I gratefully acknowledge support from the Naval Research Laboratory Base Program through Grant BE-4287-0-0-5.

### APPENDIX A

#### Normally Distributed Powers of the Error

We show here that the linear gain matrix representation in (2.18) is consistent with the solution of the state estimation problem given normally distributed powers of the error.

The basic idea here will be to treat the higher moments as simply an additional dimension to the problem. We begin by making the ansatz that the pdf of the normal distributions that we are interested in may be written as

where *N _{p}* and

*N*are normalization constants. The extended state vector represents variations of the state with respect to the moments in and is a vector with the expected value repeatedly concatenated: . The first

_{υ}*N*elements of , which is denoted by the vector , is intended to represent the state vector for the system we are interested in updating. The “carets” above a variable represent expansions as in (2.9), is the observation operator in predictor space, the covariance is defined with respect to where , and

An example of the “prior” (A.1) is plotted in Fig. A1 for the truncated two-dimensional space *x*_{1} and *x*_{2} for the chi-square prior distribution of section 4a. Note that this “pdf” is not intended to represent the probability of a particular state within the chi-square distribution but rather its moments.

Upon use of Bayes’ rule to obtain the posterior distribution we find

where *N _{a}* is simply a normalization constant and

To prove the relationship between this framework and that of section 2 we will assume that the observation operator is linear such that the mean and the mode of the “posterior” in (A.5) are identical. Hence, we may find an expression for the posterior mean by minimizing (A.6) with respect to the extended state vector . This procedure obtains the following:

The Sherman–Morrison–Woodbury formula,

allows (A.7) to be written in a more concise form:

where is the expected error covariance matrix of the posterior. Note that because the distributions in (A.1) and (A.2) are normal that the error covariance matrix of the posterior is not a function of the predictor vector . Therefore, the error covariance matrix of the posterior takes the standard form:

where is the gain matrix extended to include all powers of the forecast error :

Note that the first *N* rows of (A.11) are identical to (2.18). Hence, we have now shown that the mean of a nonnormal posterior is related to an underlying normal distribution determined by powers of the forecast error. In addition, this reveals the proper variational solution for the quadratic filter as simply the minimum of (A.6) where we truncate the expansion at

This formulation of the problem was verified to precisely reproduce the quadratic estimate of the mean of the distributions of section 4a. Note that this formulation of the variational solution allows for the solution of the state estimation problem for nonnormal priors with standard minimization techniques.

### APPENDIX B

#### Geometry of the Mean

We begin by noting that the derivative of the posterior mean with respect to the innovation is

where (B.1) is to be recognized as a matrix whose number of columns is equal to the number of observations and whose number of rows is equal to the dimension of the state vector. (Note that this implies that we have defined the derivative as referring to a row-vector operation.) The conditional density in (B.1) follows from (2.1):

The derivative of (B.2) with respect to the innovation is

where we emphasize that the quantity in (B.3) is a row vector. The normalization factors are

We assume here that the observation likelihood is normally distributed such that we may write the distribution of the innovations conditioned upon the truth as

where *N _{o}* is a normalization factor that is independent of the innovation. Because of the properties of the derivative of an exponential function the derivative of (B.6) takes a particularly simple form:

where

Similarly, for the calculation of the third moment we may concatenate the columns of into a single-column vector as

where the squared vector in (B.11) refers to the operation in (2.6). By taking the derivative of (B.11) with respect to the innovation we obtain

where is a rectangular matrix containing the required third moments of the posterior distribution. Equation (B.12) implies that the curvature of the posterior mean may be written as

where represents the slope of the posterior mean concatenated according to the structure defined by (B.11) and “⊗” is the Kronecker product.

A similar recursive relationship may be obtained for the moments beyond the third by performing the analogous analysis.