## 1. Introduction

### a. The context

This is a review of the non-Gaussian aspects of data assimilation, in the context of geophysics. Through references and a few examples, it investigates the difficulties in producing analyses using statistical modeling that goes beyond Gaussian hypotheses or beyond second-order moment closure. The emphasis is on the concepts and promising ideas, rather than on the technicalities or the completeness of the bibliography. However, mathematical details will be given when necessary or appealing. Examples, original in some cases, will be provided using simple models.

Nonlinearity and non-Gaussianity are interlaced topics, and it is difficult to discuss only one facet of the problem, neglecting the other. For instance, nonlinarities of a dynamical model inevitably produce non-Gaussian priors to be used in later analyses. Nevertheless, the focus of this review is more on the non-Gaussian aspects from the statistical modeling viewpoint, and on ways to extend the usual Gaussian analysis of current data assimilation orthodoxy. There are reviews, or relevant reports, that focus more on the nonlinear aspects but less on modeling the non-Gaussian statistics (Miller et al. 1994; Evensen 1997; Verlaan and Heemink 2001; Andersson et al. 2005). The intended scope of the article is broad: meteorology, oceanography, and atmospheric chemistry. Non-Gaussianity may take many forms there, and does not necessarily always come from the dynamics. However, a commonality of these fields is the very large dimension of the state and observation vector spaces. At first glance, this rules out most of the sophisticated probabilistic mathematical methods that are meant to estimate the full state probability density function (pdf), or higher-order moments, or to provide a state estimate without any approximation.

### b. Statistical modeling of the estimation problem

A data assimilation system consists of a set of observations, and of a numerical model, that may be static or dynamical, that may be deterministic or stochastic, and that represents the underlying physics. The mathematical modeling of uncertainty for this system implies that one embeds models and observations in a statistical framework and provides uncertainty input about them. From the uncertainty of the components of this data assimilation system, one can ultimately infer, not only an estimate of the true state, but also the uncertainty of that estimate.

This uncertainty could originate from the imprecise initial state of the system. It could also stem from the more or less precise identification of forcings of the dynamical systems, such as emission fields (in atmospheric chemistry), radiative forcing, boundary conditions, and couplings to other models that may be imperfect. The deficiency of the model itself is another source of uncertainty. To account for this type of uncertainty, models could explicitly be made probabilistic. This occurs when some stochastic forcing is implemented to represent subgrid-scale processes in Eulerian models, or when stochastic particles are simulated to represent dispersion in Lagrangian models. The uncertainty could also come from the observations in the form of representativeness or instrumental errors, or indirectly from the models and algorithms used to filter these observations through quality control. Finally, in the case of remote sensing, it could stem from the joint use of a model (a radiative transfer model for instance) and an algorithm that infers data from indirect measurements.

The proper statistical modeling depends on how uncertainty evolves under the full data assimilation system dynamics. In particular, in the context of forecasting, this modeling should properly account for the uncertainty growth–reduction cycle, which is controlled by the forecast–analysis steps of the data assimilation cycle.

Truncating statistics to the first- and second-order moments (bias and error covariance matrix) may be made necessary because of the complexity of the fully Bayesian data assimilation algorithms. It also reduces the computer storage of higher moments, which are gigantic objects in geophysical systems. This truncation may also be justified from the point of view of the evolution of the dynamical model. If, in the vicinity of a trajectory, the model can be replaced by its tangent linear approximation, then initial Gaussian statistics will remain so in this vicinity. Unfortunately, the statistics would diverge from ideal Gaussianity when the model is strongly nonlinear, when the analyses are infrequent, or when the observational data are sparse (Pires et al. 1996; Evensen 1997).

### c. Outline of the review of ideas

In this article, these arguments will be developed. At first, the reasons why researchers in geophysical data assimilation have avoided exact Bayesian modeling, even though it may be more natural, will be reviewed (section 2).

Having accepted that a fully Bayesian approach may not be computationally affordable for geophysical or large environmental problems, Gaussian filtering or variational approaches have been developed with success in geophysical data assimilation. Then the reasons why non-Gaussianity is often bound to reemerge in the statistical modeling of well-behaved geophysical problems are detailed. Next, we will briefly describe the strategies that have been employed, especially in four-dimension variational data assimilation (4D-Var) and the ensemble Kalman filter, to accommodate possible non-Gaussian deviations in the data assimilation system (section 3). However, this is not the main focus of this review, since literature offers excellent discussion papers on these topics (Kalnay et al. 2007). Ways to objectively measure the deviations from Gaussianity will then be discussed (section 4).

In section 5, the following strategies to better control the uncertainty and to make it less non-Gaussian are examined: targeting of observations, specific treatment of highly nonlinear degrees of freedom, localization, and model reduction.

In section 6, we will review a selection of new ideas to make use of non-Gaussian statistical modeling in data assimilation, either perturbatively (about a Gaussian formalism), or nonperturbatively (without direct reference to a Gaussian formalism). Several original examples will serve as illustrations.

To conclude, the future of these ideas and concepts is discussed (section 7).

## 2. Why not non-Gaussian from the start?

A Gaussian modeling of the uncertainty in geophysical data assimilation problems may not be the most natural approach to begin with. A more natural approach would be to forget about the constraints such as the dimensionality of the state space of geophysical models. In a constraint-free context, rigorous methods have been proposed in applied mathematics to solve analytically or numerically the fully nonlinear estimation problem, either in discrete or continuous time.

### a. The estimation problem

**x**

*is the state vector in ℝ*

_{k}*at time*

^{N}*t*and

_{k}**w**

_{k}and

*are noises that represent model and observation errors, respectively. They are stochastic in nature and stand for the uncertainty inherent to the system in Eq. (1). In this section, it is assumed for the sake of simplicity that these perturbations are additive, that is, one has insteadwhere*

**υ**_{k}*M*(

_{k}**x**

_{k−1}) is the model that links deterministically

**x**

_{k−1}to

**x**

*,*

_{k}**y**

*is the vector of observations in ℝ*

_{k}*at time*

^{d}*t*, and

_{k}*H*: ℝ

_{k}*→ ℝ*

^{N}*is the observation operator. Note that the number of observations*

^{d}*d*may depend on time index

*k*in any realistic context. Here

**w**

_{k}and

*are additive noises that represent model and observation errors. The sets of random vectors {*

**υ**_{k}**w**

_{k}}

_{k=1,…,K}and {

*}*

**υ**_{l}_{l=1,…,K}are taken to be white in both space and time, and they are assumed to be mutually independent.

*p*be the pdf of

_{W}**w**

_{k}, and let

*be distributed according to the pdf*

**υ**_{k}*p*. Both

_{V}*p*and

_{W}*p*may well depend on time

_{V}*t*, but the dependence is not made explicit in the notation. These laws define the transition kernel, which probabilistically relates state

_{k}**x**

_{k−1}to state

**x**

*, as well as the likelihood of the state*

_{k}**x**

*with respect to the observations*

_{k}**y**

*:*

_{k}Within this probabilistic framework, one could either be interested in estimating the true state of the system, along with its uncertainty, at the present time, or be interested in estimating the true state for all times.

*= {*

_{k}**x**

_{0},

**x**

_{1},

**x**

_{2}, … ,

**x**

*}, the collection of all state vectors from time*

_{k}*t*

_{0}to time

*t*, and 𝗬

_{k}*= {*

_{k}**y**

_{1},

**y**

_{2}, … ,

**y**

*}, the collection of all observation vectors from time*

_{k}*t*

_{1}to time

*t*, recursive application of Bayes’s law and transition rules leads to the conditional pdf (Jazwinski 1970; Lorenc 1986):where

_{k}*p*(

**x**

_{0}) is the prior pdf of the initial state vector. The ln[

*p*(𝗫

*|𝗬*

_{k}*)] would define an objective function that ranks state trajectories according to their likelihood. This is the usual Bayesian embedding of the variational formalism and in particular of 4D-Var when the statistics are assumed to be Gaussian.*

_{k}**y**

*. It makes use of Bayes’s theorem:*

_{k}### b. Nonlinear statistical time-continuous estimation

**f**is the deterministic part of the model. The noise

*d*

**w**

_{t}drives the uncertainty (

**w**

_{t}is a standard Wiener process), and is weighted by the deterministic matrix function 𝗚(

**x**

*,*

_{t}*t*). It is well known that the pdf of the full system state vector

*p*obeys a Fokker–Planck equation [see Gardiner (2004) for an exposition from the physicist’s point of view]:wherewith 𝗤

_{t}*= 𝗚(*

_{t}**x**

*,*

_{t}*t*)𝗚(

**x**

*,*

_{t}*t*)

^{T}. The observation equation has also its stochastic formulation:where

**h**(

**x**

*,*

_{t}*t*) is the deterministic observation operator, 𝗥

*is the observation error covariance matrix, and*

_{t}*is a standard Wiener process independent from*

**υ**_{t}**w**

_{t}. The evolution of the conditional pdf

*p*⋆

_{t}derived from Eqs. (7) and (10), is governed by the Zakai equation (Zakai 1969) (or alternatively normalized Kushner equation):Throughout the paper, T designates the matrix and vector transpose operator. From an analytical point of view, regardless of the algorithmic complexity and numerical cost, the full statistical estimation (smoothing or filtering) problem can be solved by exact methods. These ideas have been explored by Miller et al. (1999) from a geophysicist’s perspective. Several examples, from a one-dimensional double-well potential model to a truncated spectral barotropic model, were given as illustrations of strongly nonlinear systems.

From an algorithmic standpoint, by making the knowledge of a dynamical system a probabilistic one, the objects to deal with are not any more in ℝ* ^{N}* (estimate of a state vector). Rather, they are functions

*p*(

**x**) of

*N*variables, and possibly time in the smoothing case. This stresses the change of scale in complexity. The mathematics required to account for these problems do exist but their efficiency is questionable when applied to high-dimensional geophysical systems.

### c. Particle filters and their curse in high-dimensional systems

When it comes to numerically solving the fully Bayesian filtering (possibly smoothing) problem, the most popular approaches are based on Monte Carlo sampling (Handschin and Mayne 1969; Kitagawa 1987). Sequential Monte Carlo methods to solve these nonlinear filtering equations are called particle filters. A quite complete and very clear review of the potential applications of particle filters to the geophysical estimation problems was offered by van Leeuwen (2009).

**x**

_{k}^{1},

**x**

_{k}^{2}, … ,

**x**

*}, that is, a collection of system vector states*

_{k}^{M}**x**

*∈ ℝ*

_{k}^{i}*at time*

^{N}*t*, which samples the probability density function

_{k}*p*(

_{k}**x**

*):The weights*

_{k}*M*, but could differ in the following.

**y**

*, the analysis consists in applying the Bayes’s formula to the Monte Carlo estimate of the pdf. The weights are then simply altered by the following likelihood:and they need to be normalized so that their sum is 1. This analysis step just involves multiplications of weights and likelihoods. In particular, the innovation-statistics matrix is not inverted as would be required by most filtering methods based on first- and second-order moments. This makes the particle filter a simple and beautiful method, but we shall soon recall its curse.*

_{k}During the forecast step from time *t _{k}* to time

*t*

_{k+1}, the particles are propagated with the model

**x**

_{k+1}=

*M*

_{k+1}(

**x**

*) +*

_{k}**w**

_{k+1}. In the context of the bootstrap filter, the pdf is then simply updated at time

*t*

_{k+1}and satisfies Eq. (12) but at time

*t*

_{k+1}. This completes the particle filter cycle.

Unfortunately, for high-dimensional systems, most of the weights vanish and just a few particles remain likely (see e.g., Berliner and Wikle 2007). Therefore, in this case, the particle filter becomes useless for the estimation problem.

The ensemble size required for a proper estimation has been shown to scale exponentially with the system size, or the innovation variance. To prove this, Snyder et al. (2008) studied the statistics of the biggest weight. They demonstrated on a simple Gaussian model that the required size of the ensemble scales like *M* ∼ exp(*τ* ^{2}/2), where *τ* ^{2} is the variance of the observation log-likelihood. Under simple independence assumptions, it is expected to scale like the dimension of observation space on one hand, and the dimension of state space on the other.

This behavior is related to the so-called curse of dimensionality (Bellman 1961). Consider one of the particles, a state space vector in ℝ* ^{N}*. It is meant to be representative of a volume [−ε, ε]

*of state space centered on that particle. However, particles similar to that particle are close to it according to the Euclidean metric: they lie in a neighborhood, say within a distance of ε. In high-dimensional systems, a representativeness issue arises because of the shrinking of the hypersphere of radius ε within the hypercube [−ε, ε]*

^{N}*. Indeed the volume of the hypersphere relative to that of the hypercube vanishes like (*

^{N}*π*/4)

^{N/2}/Γ[(

*N*/2) + 1] with the space dimension

*N*. The particle is less and less representative of the cubic volume it is meant to sample. In the context of data assimilation, this implies that the observational prior and the background prior overlap less and less as the state space or observation space dimensions increase. As a consequence, most of the weights of the particles vanish, leading to a poor analysis.

To mitigate the collapse of the weights, a resampling step is often used in the bootstrap filter. Basically the idea is to draw new particles among the old ones according to the probability given by their weight. After this resampling, all the new particles have the same weight. However, it is likely that many of the new particles will be drawn from the same original particle. This will deplete the ensemble. Unless model error is already specified and of stochastic nature, it is necessary to introduce some perturbation (noise) into the forecast step, in order to enrich the ensemble.

However, the resampling does not fundamentally solve the issue since the weights still degenerate, possibly at a lower rate. The collapse of the weights could be avoided using Monte Carlo Markov chain (MCMC) techniques, based for instance on a Metropolis–Hastings selection algorithm (Gilks and Berzuini 2001). Contrary to importance sampling (and importance proposal sampling, which will be detailed later), these approaches have no exponential dependence on the dimensionality [see the illuminating discussion by MacKay (2003), chapter 29]. However, a considerable number of iterations would be required for an MCMC approach to sample a filtering pdf like those met in geophysical applications. That is why it is not clear to the authors whether these techniques will be decisive in a successful particle filter for high-dimensional systems.

### d. Illustrations

As an illustration of these concepts, a bootstrap filter is compared to the ensemble Kalman filter (EnKF) of Evensen (1994), on a Lorenz-95 model (Lorenz and Emmanuel 1998). The comparison follows the methodology of Nakano et al. (2007). This EnKF is the original single-ensemble variant, corrected by Burgers et al. (1998). The Lorenz-95 model is implemented with the standard forcing parameter *F* = 8, and *N* = 10 variables, in a perfect model setting. In particular, no stochastic forcing is applied. The absence of a stochastic term does not prevent a reliable assessment of the data assimilation system on long enough runs, thanks to the ergodicity of the dynamical system.

One of every two sites is observed. The time interval between two observations is Δ*t* = 0.05, which corresponds to a *geophysical time* of 6 h (Lorenz and Emmanuel 1998). The synthetic measurements are perturbed with a normal white noise of root-mean-square *σ* = 1.5. The root-mean-square observation error, *χ*, for the EnKF is also chosen to be 1.5. This places the filter in optimal conditions since the observation error prior statistics coincide with the true observation error statistics. Moreover, in order to reduce the undersampling errors in covariances for small ensemble size, two versions of the EnKF are implemented, with or without localization (Houtekamer and Mitchell 1998). Localization is carried out thanks to a tapering function (Houtekamer and Mitchell 2001; Hamill et al. 2001) of the form given by Eq. (4.10) of Gaspari and Cohn (1999), with an optimally tuned localization length. For the localization length as well as other parameters in the following, optimal values were selected via a sensitivity analysis to minimize the analysis error.

The particle filter to which the EnKF will be compared is the bootstrap filter that was described above. It is tested on the same setup and it uses the same parameters when applicable, as the ensemble Kalman filter. Contrary to the EnKF, no localization is used in the particle filter (this issue will be adressed later). For the particle filter, the prior observation error standard deviation *χ* is allowed to differ from the observation perturbation standard deviation *σ* = 1.5, since it is optimally tuned. For most of the ensemble size *M* range, the optimal values are close to *χ* = 2.5. This implicit error inflation is meant to account indirectly for the sampling error in the representation of the pdf [Eq. (12)]. However, for increasing values of *M* the optimal *χ* tends to *σ* (not shown).

In addition, Gaussian white (in space and time) perturbations are added to all particle state vectors with an amplitude that is optimally tuned for the sake of a fair comparison with the EnKF. Again, the optimal values are obtained thanks to a sensitivity study carried out with varying noise magnitude. Such a noise is also necessary in the EnKF case, at least for small ensemble size, since no inflation was implemented and because a single-ensemble configuration of the EnKF is used (Mitchell and Houtekamer 2009). A (residual) resampling is carried out after each analysis. The two schemes are compared on the analysis root-mean-square error (analysis rms error).

Still, the particle filter requires 10^{4} members to match the EnKF performance. Results are shown in Fig. 1. The size of the system *N* = 10 was chosen so that the EnKF/bootstrap filter cross-over could be observed with a reasonable computation load.

The collapse of the particle filter with increasing state space dimension can be illustrated on the same Lorenz-95 model. Four configurations are chosen identical to the one described above, but for four system sizes *N* = 10, 20, 40, and 80. Figure 2 displays the empirical statistics of the maximum weight in the four cases. In the first two cases, the density is rather balanced, with high values of the weight maxima that are not overrepresented. In the last two cases, the weights degenerate. When *N* = 80, the mode is near 1 and the particle filter collapses (it is of no use for estimation).

Therefore, at least with basic (though not naive) algorithms, it is still unreasonable to use particle filters for the state estimation, let alone high-order moments of the errors.

### e. Gaussian as a makeshift?

Admittedly, a fully non-Gaussian solution to the estimation problem may still be an intractable problem for large geophysical systems. The first nontrivial approximation to the statistical estimation problem is to truncate the error statistics to second order or to assume these statistics are approximately Gaussian. In this case, the multivariate pdf *p*(**x**), with **x** ∈ ℝ* ^{N}* can be solely derived from the Gaussian correlations between pairs of variables, which are functions of two variables

*p*(

*x*,

_{k}*x*), with 1 ≤

_{l}*k*,

*l*≤

*N*. These are equivalent to the full error covariance matrix. Hence, Gaussian estimation still leads to complex objects to deal with, before any reduction. It is computationally tractable only if the full covariance matrix is sampled or reduced.

Gaussian statistics have appealing properties. They are still analytically tractable in multivariate form (i.e., convolution and, integration). Their occurrence and recurrence in physical systems is supported by the central limit theorem. Moreover, Gaussians are the simplest distributions (in the sense of information theory) when only first- and second-order moments are known, as recalled by Rodgers (2000).

## 3. Dealing with non-Gaussianity in a Gaussian framework

This section explains the current way of dealing with nonlinearity and non-Gaussianity: reasonable data assimilation should consider non-Gaussian effects as corrections to a Gaussian analysis-based strategy. Variational approaches (4D-Var essentially) and ensemble-based Kalman filters include different approaches to account for model nonlinearities.

Sources of non-Gaussianity can be initially categorized into two families: nonlinearities in models and non-Gaussianity of priors. The latter will be emphasized here since it is the main focus of this review, but also because several recently developed methodologies are available. One has to keep in mind that this classification is arbitrary since nonlinear dynamical models produce inevitably non-Gaussian error statistics, which are often used as background error statistics.

### a. Non-Gaussianity from nonlinearities in models

Non-Gaussianity results from nonlinearities in models because, under a nonlinear model transition, a Gaussian pdf becomes non-Gaussian. Nonlinearities in models may come from nonlinearity of Navier–Stokes equations leading to chaos, thresholds of microphysics (cloud, ice, rain), chemistry of atmospheric compounds (including thermodynamics and aerosols size evolution equations), increases in the model resolution that require finer physical schemes such as for precipitation at convective scale, and nonlinearity of observation operator model (remote sensing applications especially: lidar and satellite), etc. Many of these sources of nonlinearities have been discussed by Andersson et al. (2005).

### b. Non-Gaussianity from priors

#### 1) Prior modeling of state space or control space variables

Non-Gaussian priors are sometimes more adequate descriptions of the background. This is especially the case for positive variables, with large deviations about their mean. This occurs in many geophysical fields. Humidity in meteorology, species concentrations and emission inventories in atmospheric chemistry, algae population in ocean biogeochemistry, and ice and gas age in paleoglaciology ice cores are just a few examples. In the case of atmospheric chemistry, it is usually considered that the typical errors in emission inventories are of the order of 40%, before any assimilation, which rules out Gaussian modeling for positive variables. Lognormal distributions are usually used instead.

Pires et al. (2010) take the example of brightness temperature from the High Resolution Infrared Sounder (HIRS) channels to show that non-Gaussianity may also stem from the variability in the specified standard deviations of background errors (a statistical property called heteroscedasticity), in particular when aggregate statistics are used in data assimilation systems.

#### 2) Observation error prior

*l*

_{1}, outside this Gaussian domain. Also, some observations may not pass the QC filter although they are strong indicators of extreme events, as it was the case for the Lothar storm of December 1999 (more information available online at http://4dvarenkf.cima.fcen.uba.ar/Download/Session_3/4DVar_nLnG_Fisher.ppt; C. Tavolato and L. Isaksen 2009, personal communication). A more tolerant filter is affordable if the observation errors are not necessarily Gaussian. This is the so-called Huber norm (Huber 1973), which is differentiable:Another possibility is to complement the

*l*

_{2}norm by a flat distribution, instead of the

*l*

_{1}norm, as implemented by Andersson and Järvinen (1999) in the European Centre for Medium-Range Weather Forecasts (ECMWF) forecasting system.

For the sake of clarity, the time index *k* is dropped here. Rather than the noise additive *y _{i}* =

*H*(

_{i}**x**) +

*υ*, the observation equation could become

_{i}*y*=

_{i}*s*(

_{i}H_{i}**x**), with

*s*a strictly positive multiplicative dimensionless factor. In that case,

_{i}*s*is a relative error.

_{i}**s**may obey a lognormal distribution. Lognormal error statistics are consistent with positive observables, and emerge quite naturally in the modeling of trace constituents and of their emissions, as mentioned earlier. From experimental or empirical modeling, one may have access to the first moments

**s**

**s**] and second-order moments of

**s**:which specifies the distribution of

**s**completely. As we shall see in section 6, it may be preferable to perform an analysis in a space where random variables are Gaussian. That is why it is convenient to use the Gaussian statistics of the vector ln(

**s**) (the logarithm applied component-wise to the vector

**s**) instead of ln(

**s**) ∼

**b**,

**y**−

*H*(

**x**)] =

**0**, then

*s*

_{i}= 1. Then there is a bias [given by Eq. (16)] that needs to be removed in Gaussian space, which is accounted for in the likelihood Eq. (17). If the median of the observation error is null, then the bias is zero:

**b**=

**0**.

*l*

_{∞}norm as an error prior penalty:In the context of atmospheric dispersion, this allowed to check rigorously that a transport model was, or was not, compatible with a set of observations, with analyzed errors lying in the predefined interval [

*υ*

^{−},

*υ*

^{+}].

Now that possible sources of non-Gaussianity have been recalled, classic solutions to deal with them are reviewed.

### c. 4D-Var solutions to deal with nonlinearity

The 4D-Var algorithm was originally proposed to solve the data assimilation problem, described as a constrained optimization problem, using classic descent algorithms. The gradient of the cost function to be minimized can be efficiently computed by optimal control techniques (Le Dimet and Talagrand 1986; Talagrand and Courtier 1987; Courtier and Talagrand 1987).

**x**

_{0}is given in terms of departure from some known background

**x**

*[i.e.,*

_{b}*p*(

**x**

_{0}) =

*p*(

_{B}**x**

_{0}−

**x**

*)] and that the departure vector*

_{b}**x**

_{0}−

**x**

*, the model error*

_{b}**w**

*, and the observation error*

_{k}*are independent Gaussian vectors with zero mean and covariance matrices 𝗕, 𝗤*

**υ**_{k}*, and 𝗥*

_{k}*, respectively. Maximizing the conditional pdf*

_{k}*p*(𝗫

*|𝗬*

_{K}*) (for a maximum likelihood estimate of*

_{K}**x**

_{0}) amounts to a minimization of the weakly constrained 4D-Var cost function:4D-Var data assimilation algorithms are used to estimate the initial condition in state of the art operational centers (e.g., for meteorology see Rabier et al. 2000), given the approach’s ability to provide flow estimates consistent with the flow evolution and the asynchronous nature of the observations.

When the model and the observation operators are linear or weakly nonlinear (in the case of small time steps of model simulation and of short assimilation time intervals), these Gaussian assumptions may hold reasonably well. However, when the assimilation windows are long enough or the models are strongly nonlinear, the Gaussian assumptions will certainly break down and the conditional pdf could become multimodal. As a result, the maximum likelihood estimates become less informative (Lorenc and Payne 2007). This is equivalent to the existence of multiple minima with the 4D-Var cost function (Gauthier 1992; Miller et al. 1994; Pires et al. 1996), while the deterministic numerical optimization seeks only one relative minimum.

To find the global minimum of

One feasible remedy is to deal with the original nonlinear optimization problem approximately by a succession of inner-loop quadratic optimization problems, in which the model is simplified (at a lower resolution with simpler physics) and linearized (Laroche and Gauthier 1998). The input of one inner-loop iteration is generated by relinearizing the original nonlinear model around the state adjusted by the output of the previous inner-loop iteration. This inner/outer approach may fail when there exist significant inner-loop linearization errors for high-resolution models and longer assimilation windows in a context of perfect models (strong-constraint 4D-Var; Trémolet 2004). This difficulty can be alleviated by introducing model errors (weak-constraint 4D-Var). Indeed, the propagation of information within the assimilation window with the tangent linear model is shortened as compared to the strong constraint 4D-Var, thanks to model error present at each time step (Andersson et al. 2005; Fisher et al. 2005; Trémolet 2006).

Another approach, the quasi-static variational assimilation, was proposed by Pires et al. (1996) for the assimilation of dense observations. The global minimum was guaranteed by progressively lengthening the assimilation periods, thus always keeping control of the first guesses when using a gradient descent method in the cost function minimization.

### d. EnKF solutions to deal with nonlinearity

**y**

*at time*

_{k}*t*is available, an estimate (analysis)

_{k}**x**

*can be obtained by optimally combining some a priori state*

_{k}^{a}**x**

*and*

_{k}^{f}**y**

*. A Best Linear Unbiased Estimator (BLUE) analysis readswhere 𝗣*

_{k}*= E[(*

_{k}^{f}**x**

*−*

_{k}**x**

*)(*

_{k}^{f}**x**

*−*

_{k}**x**

*)*

_{k}^{f}^{T}] is the a priori error covariance matrix, 𝗣

*= E[(*

_{k}^{a}**x**

*−*

_{k}**x**

*)(*

_{k}^{a}**x**

*−*

_{k}**x**

*)*

_{k}^{a}^{T}] the analysis error covariance matrix, and 𝗞

*is the gain matrix. The form of 𝗞*

_{k}*requires a linear approximation 𝗛*

_{k}_{k}of the observation operator

*H*around

_{k}**x**

*. This analysis*

_{k}^{f}**x**

*is optimal (among all possible linear forms) in the sense that the total analysis error variance Tr(𝗣*

_{k}^{a}*) is minimized. When the a priori and observation errors are assumed to be Gaussian, the analysis*

_{k}^{a}**x**

*in Eq. (20) is also the maximum likelihood estimate of the model state that minimizes the 3D-Var cost function [*

_{k}^{a}*K*= 1 in Eq. (19)].

To close the algorithm, the a priori state *t*_{k+1} can be chosen to be the forecasts starting from the analyzed state **x*** _{k}^{a}* and error covariance 𝗣

*. When the dynamical model is linear, these forecast and analysis formulas are the Kalman filter equations. For nonlinear models, the extended Kalman filter approximates the evolution of the error covariance using tangent linear approximations of the model equation around*

_{k}^{a}**x**

*. The ensemble Kalman filter (Evensen 1994) uses an ensemble of state vector samples {*

_{k}^{a}**x**

*,*

_{k}^{i}*i*= 1, … ,

*M*} to approximate the error covariance 𝗣

*and 𝗣*

_{k}^{f}*. This lessens the instability of the covariance evolution equation caused by the truncation errors when linearizing models for strongly nonlinear systems. With a small ensemble [e.g.,*

_{k}^{a}*O*(100) members], the EnKF is feasible for large geophysical applications. More algorithmic details can be found in Evensen (2003) and Houtekamer and Mitchell (2005).

Reports about the EnKF can be found in meteorology, oceanography, hydrology, and several other fields (Evensen 2003). The reasons for this popularity are multifold. First, although many environmental systems are nonlinear and high dimensional, there exists low-dimensional subspace (local and global attractors) which represents reasonably well the complete dynamics (Lions et al. 1997; Patil et al. 2001). Thus, the pdf may be represented by a proper ensemble with a limited number of members. Second, it is well known that an ensemble forecast has the advantage against a single control forecast (Leith 1974). Finally, the Gaussian assumption at analysis time may be suitable for many scenarios (e.g., the Gaussian background errors for global numerical weather predictions as in Andersson et al. 2005).

For large systems in meteorology, the most effective EnKF schemes are those that localize the background error covariance (Houtekamer and Mitchell 2001; Hamill et al. 2001) so that spurious correlations at long distance are reduced. In other fields such as oceanography and air quality a related approach using reduced-rank Kalman filters (Cane et al. 1996; Heemink et al. 2001; Pham et al. 1998) has also been tested. Such filters work only in subspaces of the complete error space (Lermusiaux and Robinson 1999; Nerger et al. 2005). The EnKF can also be viewed as a reduced-rank Kalman filter, since the error covariance matrices are approximated by the ensemble statistics in a square root form (Tippett et al. 2003). The analysis in Eq. (20) has two implementations: a deterministic scheme (Whitaker and Hamill 2002) or with perturbations of observations for consistent error statistics (Burgers et al. 1998). They differ from each other in handling non-Gaussianity (Lawson and Hansen 2004).

Improvements to the EnKF are essentially driven by the design of better sampling strategies for the ensemble generation: the second-order exact resampling (Pham 2001), the unscented sampling (Van der Merwe et al. 2000), and the mean-preserving sampling (Sakov and Oke 2008). Increasingly, model deficiencies are simulated using ensemble members generated with different versions of the underlying forecast model (e.g., with different physical parameterizations; Meng and Zhang 2007; Fujita et al. 2007; Houtekamer et al. 2009; or with perturbations of model parameters; Wu et al. 2008).

Another idea is to bridge the gap between variational and sequential approaches, and to improve the EnKF performance (Kalnay et al. 2007) using ideas and techniques developed for 4D-Var. Such attempts are for example: the inner–outer loop to deal with nonlinearities (Kalnay et al. 2007), the variational formulation to treat the non-Gaussian error structure in observation (Zupanski 2005) and background (Harlim and Hunt 2007), and the time interpolation of the background forecasts to the observations so as to produce time-coherent assimilations of all the observations available within the assimilation window (Hunt et al. 2004; Houtekamer and Mitchell 2005). Alternatively, the 3D-Var or 4D-Var can use flow-dependent error covariances computed from the EnKF ensembles (Buehner et al. 2010), which leads to more efficient hybrid algorithms.

## 4. Measuring non-Gaussianity

In a Gaussian framework, one needs tools to assess the deviation from Gaussianity mainly induced by nonlinearities of the model: objective mathematical measures or statistical tests. These tools will be reviewed in this section, and, moreover, some of them will be used in section 6.

### a. Relative entropy

*p*and

*q*is given by the Kullback–Leibler divergence or relative entropy (Kullback 1959):Coming from signal and information theory, it quantifies the information gain from

*q*to

*p*. It has (axiomatic) properties that make it very attractive (Cover and Thomas 1991). First of all, it is always nonnegative. It is null if and only if

*p*is equal to

*q*almost everywhere. It is also convex with respect to both

*p*and

*q*. Moreover, it is invariant by any one-to-one reparameterization

**x**= Ξ(

**). However, it is not a distance in the mathematical sense since it is not symmetric with respect to**

*θ**p*and

*q*, nor does it obey the triangle inequality.

The measure has been used in geophysical, high-dimensional applications, in predictability (Kleeman 2002), in the statistical modeling of geophysical dynamical systems (Haven et al. 2005), in inverse modeling (Bocquet 2005b,c), and in the modeling of prior pdfs (Eyink and Kim 2006; Pires et al. 2010).

The Kullback–Leibler divergence can serve as an objective function to measure deviation from Gaussianity. If *p* is the full pdf of the uncertainty for the system, and *q* ≡ *p _{G}* is the Gaussian pdf that has the same first- and second-order moments, then

*p*,

*p*), often called

_{G}*negentropy*, is a measure of the non-Gaussianity of

*p*. Obviously if

*p*is Gaussian, the divergence is null and positive otherwise.

The pdf *p* could be estimated approximately by the use of an ensemble, such as the one used by ensemble-based filters. It is, however, difficult to perform such estimation for high-dimensional systems, especially with a small ensemble. Besides, the presence of a strange attractor complicates the numerical convergence and a proper definition of a continuous limit. Several solutions have been proposed to overcome the difficulty: compute relative entropies of marginals of *p* or compute expansions of the relative entropy.

*p*,

*p*), with a pdf

_{G}*p*depending on

*N*variables can be obtained. For each subset

*s*of

*n*≤

*N*variables, one integrates out the pdfs

*p*and

*q*on this subset of variables and obtains a divergence

*p*/

*q*(e.g., see Barndorff-Nielsen and Cox 1989). These expansions depend on skewness and kurtosis, which is consistent with their use in the diagnosis of non-Gaussianity. They are expressed in terms of cumulants of the distribution. They both represent the same expansion, but the terms are ordered differently. In the Gram–Charlier expansion, the ordering index is the cumulant order, while in Edgeworth, the ordering index is the size

*M*of the ensemble, which samples the distribution. The latter expansion makes the Edgeworth expansion more controlled, though less simple. Using these expansions, the relative entropy can be approximated bywhere

*κ*

_{i1,i2,…,in}are the

*standardized*cumulants of

*p*of order

*n*.

Figures 3 and 4 illustrate the departure from Gaussianity and ways to measure it on a deterministic Lorenz-63 model (Lorenz 1963), where the negentropy can be estimated numerically. A Gaussian pdf sampled by particles, initially of covariance matrix 𝗣 = diag(*σ*^{2}, *σ*^{2}, *σ*^{2}), with *σ* = 0.20, is transformed under the model flow. The full negentropy is estimated via a numerical integration. As explained by Kleeman (2002), relative entropy must be estimated at fixed resolution, possibly fine enough to encapsulate the attractor [a spacing of about *r* = Δ*x* = Δ*y* = Δ*z* ≃ 0.1 has been chosen to discretize the integral Eq. (23)]. Edgeworth expansion, and its estimates based on one- and two-variable marginals are estimated as well. The result of the Edgeworth expansion cannot be directly compared to the full relative entropy estimate of finite resolution. Indeed the ensemble members tend to gather close to, or on, the attractor, which makes their distribution more and more singular with the flow’s evolution. For the sake of comparison, the ensemble must therefore be smoothed out by a normal law (of variance 0.5 here) yielding a finite resolution value. Obviously after time *t* = 0.5, the cluster of particles, stretched by the flow, loses its cohesion, and the pdf becomes significantly non-Gaussian. This is confirmed by the indicators based on the negentropy, as well as the Edgeworth expansion.

The deviation from Gaussianity is reported by each one of these indicators, though not with the correct magnitude. Yet, as numerical estimations of the Kullback–Leibler divergence, all these approximations are unsatisfactory.

### b. Univariate and multivariate tests of normality

Because such computations cannot easily be generalized to high-dimensional, complex dynamical systems, one could rely on simpler necessary tests of normality. Hypothesis testing is a well-developed topic in statistics, and many techniques meant to test the Gaussianity of random variables exist.

Among the many tests of normality available in the statistical literature, the skewness and kurtosis, which are directly defined by the cumulants of a distribution, have been used very early. Lawson and Hansen (2004) have used them to assess how differently stochastic and deterministic ensemble-based filters handle non-Gaussianity. There are many other tests such as the Kolmogorov–Smirnov test (Lilliefors 1967), the Anderson–Darling test (Anderson and Darling 1952), the Shapiro–Wilk test (Shapiro and Wilk 1965), and their many variants.

A few generalizations of these tests to multivariate statistics do exist, but are not meant to handle system sizes as big as those in geophysics. One is therefore compelled to use necessary though insufficient tests. Examples have been given earlier with the use of marginal distributions in the computation of negentropy. One could also rely on combinations of variables in the systems, such as sums (or sums of squares) of individual degrees of freedom, which are supposed to be close to Gaussian. Assuming a Gaussian distribution for the elementary degrees of freedom, it may be possible to compute the distribution of these consolidated random variables. Then univariate null-hypothesis statistical tests, such as those mentioned earlier can be used in turn.

*χ*

^{2}distribution. Bengtsson et al. (2003) used such a test to obtain a measure of the deviation from normality. They considered a subset of three adjacent variables

*x*

_{1},

*x*

_{2}, and

*x*

_{3}in the Lorenz-95 model, so as to reduce the number of degrees of freedom to handle. An ensemble (drawn from an ensemble-based assimilation technique) represents the uncertainty in the system. If

**Σ**is the covariance matrix of this ensemble, defined on the subset, and if

**z**

*is the deviation of the*

_{i}*i*th member from the mean (restricted to the subset), then a scalar random variable that combines the three degrees of freedom isIt should follow a

*χ*

^{2}distribution with three degrees of freedom. The authors tested this hypothesis using the Kolmogorov–Smirnov test. This way they showed that the forecast produced by an EnKF exhibits significantly non-Gaussian features, at least for long intervals between observation times (cf. Δ

*t*= 0.4, whereas Δ

*t*= 0.05 in this review).

## 5. Reducing nonlinearity’s impact: Divide and conquer

With a denser monitoring network or more frequent observations, the model should remain closer to its tangent linear trajectory between analyses. Therefore, non-Gaussianity will not develop as much as in a system less constrained by observations. Nonetheless, if, with an increasing number of observations, the model resolution is increased as well and subgrid processes are explicitly represented at a finer scale, new sources of nonlinearity and non-Gaussianity might appear as discussed in section 3. In the context of meteorology, the finer the horizontal space scale, the bigger the error growth rate (Lorenz 1969; Tribbia and Baumhefner 2004; Lorenc and Payne 2007). As a consequence how non-Gaussian the errors are may well depend on their scale. Increasing both the space and time resolution and the observation density may lead to a data assimilation system with the appearance of significantly non-Gaussian errors statistics at the convective scale while synoptic-scale errors become smaller and more Gaussian.

In this section, following this paradigm but without going as far as adopting a broad multiscale view on non-Gaussianity, we review some of the ideas put forward to reduce non-Gaussianity and nonlinearity, so that classic data assimilation based on Gaussian hypotheses could become more efficient.

One idea consists in using targeted (also called adaptive) observations in order to obtain a better control. A second one consists in dividing the system between degrees of freedom that are more or less prone to nonlinearities, and hence require more or less accounting for non-Gaussian effects. Another idea consists in representing non-Gaussian features, such as multimodality, via a sum of individual Gaussian components.

### a. Better control with adaptive observations

The analysis improvement and the reduction of its computational cost can be obtained by an assimilation that adapts to the properties of the dynamical flow, in particular its instability. For instance, Pires et al. (1996) have shown that the efficient variational assimilation length *τ*_{eff}(**x**) is proportional to *λ*^{−1}(**x**), where *λ*(**x**) is the leading local Lyapunov exponent at **x**. From Eq. (3.15) of Pires et al. (1996), and relying on their simplifying assumptions [i.e., perfect model, frequent and regular observations within the assimilation window, and *λ*(**x**)Δ*t*(**x**) ≪ 1], the leading analysis error variance *e*^{2}(**x**) is constrained by: *e*^{2}(**x**) ≤ 2*λ*(**x**)Δ*t*(**x**)*σ*^{2}, where observations of variance *σ*^{2} are obtained each Δ*t*(**x**) (much shorter that the assimilation window). Thus, smaller observation intervals are required for cases that are more unstable.

Adaptive techniques in data assimilation also call for the deployment of targeted observations (TOs), pioneered by the singular vectors approach (Buizza and Montani 1999). Then, Daescu and Navon (2004) use the adjoint sensitivity approach, evaluating the sensitivity function ‖**∇***J*‖, the norm of the gradient of a forecasting error functional *J*. Local maxima of this sensitivity function in the physical space determine the set

The number of tracking observations, necessary for the stabilization of a sequential prediction-assimilation system and tracking unstable flow, depends on the number and magnitude of the system’s *m* positive Lyapunov exponents (Carrassi et al. 2008). The efficient monitoring of the *m*-dimensional unstable space E is achieved through the blending of a fixed observational network with updated TOs. Efficient analyses with fixed observations are obtained through the assimilation in the unstable subspace (Carrassi et al. 2007) where the analysis increment is confined to the updated unstable subspace E, obtained by the method of breeding of the data assimilation system.

### b. Bayesian filtering in reduced-rank system or subsystem

Following the divide and conquer strategy, the use of exact Bayesian techniques, such as particle filters, could be restricted to the significantly non-Gaussian degrees of freedom of the geophysical system. For instance, Lagrangian assimilation of data from oceanographic drifters is highly non-Gaussian since the positions of the drifters need to be controlled too. Spiller et al. (2008) have successfully tested several particle-filtering strategies on such drifters in a flow generated by point vortices. Berliner and Wikle (2007) and Hoteit et al. (2008) explore theoretically the use of particle filters, but on identified low-dimensional manifolds of the dynamics, or on a reduced-order model of a large geophysical system [e.g., through empirical orthogonal functions (EOFs)].

### c. Localizing strategies for particle filters

In the context of the fully Bayesian estimation problem, non-Gaussian uncertainty could be reduced by localization of the analyses. Indeed the smaller the area, the smaller the number of degrees of freedom to handle, the less complex (e.g., multimodal) the local pdf of these degrees of freedom should be. As a consequence, the smaller the area, the lower is the necessary number of particles for a given precision estimate. However, contrary to localization in the EnKF, the analyses cannot be simply glued together to get a complete set of updated global particles. That is why localization was not used on the particle filter in the illustrations of section 2. This issue has been largely discussed by van Leeuwen (2009).

### d. Gaussian mixtures

*p*(

**x**) is parameterized by Gaussian kernels as (Silverman 1986)where

*n*is the pdf of the Gaussian distribution

With a view to particle filtering, one could replace each particle of the ensemble by a broader Gaussian kernel. Unfortunately, for high-dimensional systems, this kernel representation also suffers from the curse of dimensionality (Silverman 1986). Recently proposed remedies are essentially filtering in the low-dimensional subspace related to the attractor of the complete system, as mentioned in the previous section. This can be implemented either by a localization and smoothing procedure (Bengtsson et al. 2003). Or it can be carried out thanks to a low rank representation of the error covariance matrix (Hoteit et al. 2008) inherited from the reduced rank Kalman filters. Note that in Bengtsson et al. (2003), the error covariance matrices associated with the kernels are not identical but generated by a Kalman filtering for each sample **x*** ^{i}*. Nevertheless, Gaussian mixture models are distinct from the unscented particle filter (Van der Merwe et al. 2000) where the sequential Monte Carlo sampling is performed according to locally linearized importance sampling functions given by the posterior pdf of Kalman filters for each of the particles (see section 6).

## 6. Bridging the gap between Gaussian and non-Gaussian data assimilation

There have been recent attempts to make use of non-Gaussian ideas in geophysical (or geophysically inspired) data assimilation. They remain quite specific in their application, because of their underlying hypotheses. They are, nevertheless, promising and a discussion on their relevance to geophysical data assimilation is presented. Contrary to section 3 where non-Gaussian errors were described and modeled mathematically, the emphasis here is on producing the analyses that cope with those non-Gaussian errors.

### a. Statistical expansion about the climatology

In an ensemble-based filtering system, the estimates of the first- and second-order moments rely on the ensemble itself. Instead of forming a Gaussian pdf as a prior for the analysis using these statistics, one computes the pdf that is *closest* to the climatology and that has the same first- and second-order moments, as shown by Eyink and Kim (2006).

*distance*between a pdf

*p*and a climatology

*q*is provided by the relative entropy in Eq. (23). Let us assume that the observation operator 𝗛 is linear. The minimization of Eq. (23) with respect to

*p*, under the constraints of the first- and second-order statistics:where

*M*is the ensemble size, yields the generic exponential solution:where the vector

**∈ ℝ**

*λ**is the Lagrange parameter conjugated to the mean constraint, and the matrix Λ ∈ ℝ*

^{d}^{d×d}is a Lagrange parameter matrix conjugated to the variance constraint. Next, inserting this exponential law in the relative entropy, one is led to the optimization on these dual parameters:where

*Z*(

**, Λ) is the partition functionwith ∑**

*λ***an integration symbol that represents a sum when a discrete distribution is considered or an integral when the target space of the distribution is continuous.**

_{x}**y**with error covariance matrix 𝗥), the pdf is updated using Bayes’s rule, within a dual framework. The dual parameters update reads as follows (in a fashion similar to the so-called

*information*Kalman filter):

This work has also been put forward by van Leeuwen (2009) in his review on particle filters. However, we do not consider the method to be a particle filter, since it involves the truncation of the moments to second order, and the most innovative part is the treatment of the prior. Though the idea is very appealing, it remains to be proven that an attractor of the dynamics can be described analytically or numerically so that this information can be used in the method. Eyink and Kim (2006) tested their method on the Lorenz-63 model, using a mixture of two Gaussians to describe the two lobes of the attractor. The results show that the method eventually outperforms the EnKF, but in a regime where the filter is very nonlinear. This occurs when the time interval between two analyses reaches about Δ*t* = ⅔, possibly when the climatology starts having a significant impact on the filter trajectories. This might not reflect realistic conditions, since the time interval between two analyses in weather forecasting would rather correspond to Δ*t* = 0.05.

### b. Gaussian anamorphosis

One way to treat non-Gaussianities is to attempt to transform, analytically or numerically, non-Gaussian random variables into Gaussian ones, on which a BLUE-based analysis can appropriately be carried out.

#### 1) Analytical transformation

*H̃*may be deduced from the original observation model

*H*through

*H̃*= ln ∘

*H*∘ exp. Symbol ∘ is the function composition operator. The change of state variable

**x̃**≡ ln(

**x**), enables the construction of a BLUE estimator in this

*Gaussian space*:where the bias

**b**has been defined by Eq. (16) of section 3, and with the usual optimal gain, though in Gaussian space:Here

*H̃*, and

**x̃**

^{b}are the background covariance matrix and first guess, respectively, in

*Gaussian space*. One can pull the fields back into the original space and obtain the optimal estimators:One weak point of the approach is that the

*Gaussian space*observation operator

The variational version of this analysis, essentially based on Eq. (17), was examined by Fletcher and Zupanski (2006), including thorough discussions on how to choose a proper estimator and how to precondition the minimization of a cost function such as Eq. (17). Mapping the lognormal errors to a Gaussian space is a particular (analytical) case of a *Gaussian anamorphosis*.

#### 2) Numerical transformations

When an analytical transformation to Gaussian space is not possible because the errors do not necessarily follow a lognormal behavior, then numerical methods can be used to achieve a similar goal. This is called a (numerical) Gaussian anamorphosis. This technique is well known in geostatistics (Wackernagel 2003). Its use has been advocated by Bertino et al. (2003) in the context of geophysical data assimilation (see also the next section). The idea of performing the analysis in the *Gaussian space* is the same as that of Cohn (1997), but for a general, albeit numerical transformation.

*X*∼

*P*, with values in the state space

_{X}*P*

_{Γ}, with values in the state space

*F*is invertible, then the anamorphosis function is defined byThis deterministic map transforms a Gaussian random variable into a non-Gaussian one. The inverse mapping pulls the non-Gaussian variable back to a random Gaussian variable:There is a natural numerical counterpart to this analytical construct, the so-called empirical Gaussian anamorphosis. If one has

*n*samples of

*X*in

*x*

_{1}<

*x*

_{2}< ··· <

*x*under the simplifying assumption that all

_{n}*x*are distinct, then the empirical anamorphosis function

_{i}*I*

_{]a,b]}is the support function on interval ]

*a*,

*b*] (i.e.,

*a*excluded while

*b*is included), equals to 1 on the interval, and 0 everywhere else. However, since

*φ*is a stepwise function because of the discrete data, it is not invertible and needs to be smoothed out. One proper and convenient filtering is obtained by a truncated expansion of the empirical Gaussian anamorphosis on a basis of Hermite polynomials. Details can be found in (Wackernagel 2003).

_{n}In principle, a Gaussian anamorphosis is needed in both state space and observation space, then analysis equations similar to Eqs. (35) and (36) can be applied in Gaussian space. An inverse Gaussian anamorphosis is then built to pull the analyzed fields back into the original space.

It has recently been implemented on a large ocean and biogeochemical model by Simon and Bertino (2009) with success in a twin experiment. The transformation was applied to a chlorophyll field. Applying this methodology to such a large-scale experiment is not simple. As a first reasonable step, the authors neglected the correlations, and considered some climatological statistical univariate distributions of the non-Gaussian variables, when the anamorphosis is well defined and simple to implement. To take into account the full correlations, one would need to consider multivariate anamorphosis. With multivariate statistics, one would have to rotate the state space to get uncorrelated variables, by principal component analysis or independent component analysis (Hyvärinen and Oja 2000), and then apply Gaussian anamorphosis to each of the marginals. However, a non-Gaussian part of mutual information (in other words, residual correlations) remains in the rotated space (Pires and Perdigão 2007).

#### 3) Humidity transform in meteorological models

*q*and relative humidity RH fields have intrinsically non-Gaussian distributions due to their finite interval supports. As a consequence, background and observational errors are also non-Gaussian by nature, also exhibiting largely inhomogeneous statistics both in latitude and height as well as presenting cross correlations with different control variables. To optimally apply a BLUE-based analysis, one has to find a proxy control humidity variable

*ϕ*(Φ) of some set Φ of the thermodynamic background variables (e.g., q, pressure p, temperature T), whose conditional background error

*ϵ*|Φ is at least approximately Gaussian. Hólm et al. (2002) have proposed several control variables

_{b}*ϕ*(Φ) from the distribution of the corresponding forecast differences

*δϕ*, extracted from observing system simulation experiments (OSSEs) performed with the ECMWF data assimilation system. Since

*δϕ*is a difference between background errors, the conditional pdf

*p*(

_{δϕ}*δϕ*|Φ) is the convolution of the conditional pdf

*p*(

_{b}*ϵ*|Φ) of background errors

_{b}*ϵ*:Therefore, Gaussian forecast differences

_{b}*δϕ*lead to a Gaussian

*ϵ*of variance var(ϵ

_{b}*) = ½var(*

_{b}*δϕ*) and a quadratic background log-likelihood function

*J*(

_{b}*ϵ*) with a single minimum and thus a simpler procedure of minimization. Given those advantages, one then aims to get at least quasi-Gaussian

_{b}*δϕ*values. For

*δϕ*equal to

*δq*or

*δ*ln(

*q*), one obtains approximately exponential distributions, whereas

*δ*RH is more closely Gaussian (Hólm et al. 2002). To get collected Gaussian statistics of

*δϕ*for all grid points together and Φ states, it is preferable to use the normalized forecast difference:where

*b*(

*δϕ*|Φ) and

*σ*(

*δϕ*|Φ) are, respectively, the bias and standard deviation of

*δϕ*conditioned on Φ. Thanks to the homoscedasticity (uniform standard deviation) of

*Gaussianization*procedure where one applies the (inverse) anamorphosis defined in the previous section to the forecast differences for all conditions Φ by the transform:where

*F*is the cdf of

*δϕ*|Φ and

*G*is the Gaussian cdf. A standard Gaussian homogeneous control increment is readily obtained by the normalization of

*f*(

*δϕ*|Φ) with the corresponding bias and standard deviation, conditioned on Φ (Hólm 2007). This Gaussian anamorphosis applied to different humidity datasets, sometimes called

*Hólm transform*in this context, has been shown to have a significant impact in the medium-range ECMWF weather forecasts (Andersson et al. 2007).

#### 4) Gaussian analyses under linear inequality constraints

Some additional constraints may render a Gaussian data assimilation scheme non-Gaussian. This may happen when the prior forces the control variables to lie in a polytope (to satisfy linear inequalities), or when an observation error prior must account for outliers (as in section 2). If the unconstrained priors are Gaussian, then the constrained priors are truncated Gaussian priors. Remarkably, several Gaussian data assimilation schemes can be extended to the truncated Gaussian case in a mathematically rigorous way, with limited complications.

In variational data assimilation, Lagrangian duality (Borwein and Lewis 2000) can be used to lift these constraints, either on the observational errors (as made explicit in section 2), or in the state background errors (Bocquet 2008), through the use of Lagrange multipliers. The transformation is essentially exact if the cost functions are convex, a requirement that may not be satisfied if the models are nonlinear.

In ensemble-based Kalman filtering, filters can be extended to deal with linear inequalities. Assume the background is a truncated Gaussian, whereas the observation errors are normal. Then the analysis as seen from the Bayes’s formula yields the product of a truncated Gaussian by a Gaussian, which is in turn a truncated Gaussian. Besides, the analysis uses the same set of operators (such as the Kalman gain) as in the unconstrained case. This makes the use of such a scheme very practical. The major change comes from the need to sample from a truncated Gaussian, which is not straightforward for high-dimensional problems. The truncated Kalman filter was developed by Lauvernet et al. (2009) in a geophysical context and successfully tested on a one-dimensional mixed layer ocean model.

### c. Using non-Gaussian deviations in the priors to improve analysis

Given the sampling statistics of innovations, it is possible to compute the mean and covariance matrix of the innovation vector. This is useful to correct error biases and for tuning the prescribed error covariance matrices (Desroziers et al. 2005). Beyond those statistics, innovation histograms and higher-order moments of the innovations can also be computed, for instance some measures of non-Gaussianity like the skewness *s _{d}* and kurtosis

*k*. Pires et al. (2010, manuscript submitted to

_{d}*Physica D*) have computed diagnostics of

*s*,

_{d}*k*for the quality-controlled ECMWF innovations of brightness temperatures of a set of HIRS channels. Their results emphasized the statistically significant non-Gaussianity of the errors in several channels. They estimate a joint non-Gaussian prior pdf for the observations errors

_{d}*ϵ*and background errors

_{o}*ϵ*in the observation space, using the maximum entropy on the mean (MEM) method. The method follows the same principle as the one used and exemplified in sections 6a and 6e. The output of the method is a pdf of

_{b}*ϵ*and

_{o}*ϵ*, compatible with the prescribed innovation statistics. Moreover, it is minimally committed in the sense that, from the information theory point of view, it is the simplest pdf (with minimal extra information), that explains the prescribed statistics. This prior modeling can be shown to be beneficial to the subsequent analyses that go beyond the BLUE result.

_{b}### d. Particle filtering with Gaussian filters as importance proposals

We come back to the ideas of the particle filter. We will see how Gaussian analyses can help to numerically solve the Bayesian estimation problem. The ideas are fairly recent in the geophysical community and the extrapolation to complex systems is speculative.

#### 1) Concept of importance sampling

*= {*

_{k}**x**

_{0},

**x**

_{1},

**x**

_{2}, … ,

**x**

*} conditional on the collection of observations 𝗬*

_{k}*= {*

_{k}**y**

_{1},

**y**

_{2}, …,

**y**

*} is considered, up to time*

_{k}*t*. For any time index

_{k}*k*= 1, … ,

*K*:This representation is a combination of

*M*particles’ trajectories 𝗫

*= {*

_{k}^{i}**x**

_{0}

*,*

^{i}**x**

_{1}

*,*

^{i}**x**

_{2}

*, … ,*

^{i}**x**

*} and weights*

_{k}^{i}*ω*attached to each one of them. One has the freedom to draw the particle trajectories from a known

_{k}^{i}*proposal*pdf

*q*. These particles can have any distribution, provided the support of

_{k}*q*includes that of

_{k}*p*. In order for the particle filter to still solve the Bayesian problem, the weights need to be corrected so that the discrete pdf is still representative of the system pdf:In the Monte Carlo methods literature, this is known as importance sampling (Doucet et al. 2001).

_{k}*p*(𝗫

_{k}*|𝗬*

_{k}*) according toIf, in addition, one assumes that the proposal distribution is obtained by filtering (not smoothing, i.e., the state pdf depends only on current and past observations), then one relates the importance proposal at successive times according toThus, in a sequential context, the recursion formula for the weights reads as follows:The importance proposal pdf of the bootstrap filter is the transition operator of the model: if the proposal is*

_{k}*q*(

_{k}**x**

*|𝗫*

_{k}_{k−1}, 𝗬

*) ≡*

_{k}*p*(

_{k}**x**

*|*

_{k}**x**

_{k−1}) then the bootstrap filter is recovered, since

*q*≡

_{k}*p*(

_{k}**x**

*|*

_{k}**x**

_{k−1},

**y**

*), minimizes the variance of the weights. It is called the optimal importance function (Doucet et al. 2000).*

_{k}In our opinion, the same importance sampling principle can be used to justify two attempts to improve particle filtering from the geophysicist point of view. Xiong et al. (2006) make use of a Gaussian resampling, based on the particles’ first- and second-order moments. It was shown to improve the forecast ability of the particle filter, often beating the EnKF, in the case of the Lorenz-63 model. The merging particle filter of Nakano et al. (2007) is, in a similar flavor, a Gaussian resampling (matching of first- and second-order moments) and is used to enrich the sampling of the particle filter. The authors demonstrate a significant improvement with the Lorenz-63 and Lorenz-95 models, but the particle filter still necessitates too many particles as compared to the EnKF, even on these toy models. Although it is not stated in those words, these papers illustrate the use of Gaussian hypotheses on rigorously non-Gaussian estimation, through the use of importance sampling. However, to our knowledge, the necessary correction to the weights for the particle filters, in order to guarantee the proper Bayesian asymptotics, were not computed.

Now we come back to the problem of the collapse of the particle filter. The basic idea that was fostered in the applied mathematics community, and advocated by van Leeuwen (2009) in geophysics, is that in order to avoid too unlikely trajectories, particles should be drawn at time *t*_{k−1} from a proposal making use of **y*** _{k}*. For high-dimensional applications and complex models, this is certainly not trivial to implement. The following section gives clues, and original numerical examples based on this idea.

#### 2) Current-observation-dependent proposal with Gaussian analyses

*q*≡

_{k}*p*(

_{k}**x**

*|*

_{k}**x**

_{k−1},

**y**

*). Its implementation is only practical (and without approximation) when the observation operator is linear (Doucet et al. 2000). Let us use a data assimilation system of the form in Eq. (2), with*

_{k}**w**

*∼*

_{k}*) and*

_{k}*∼*

**υ**_{k}*), and a linear observation operator 𝗛*

_{k}*. Even though the local error assumptions are Gaussian, the filter is meant to account for any nonlinear dynamical model. For each particle, a simple BLUE analysis is carried out which strikes the balance between the uncertainty 𝗥*

_{k}*of the observation*

_{k}*and the uncertainty of the model 𝗤*

**y**_{k}*at time*

_{k}*t*. Therefore, the particles are propagated according to the pdf:with

_{k}**Σ**

_{k}

^{−1}= 𝗤

_{k}

^{−1}+ 𝗛

_{k}

^{T}𝗥

_{k}

^{−1}𝗛

_{k},

**x**

*=*

_{k}^{a}*M*(

_{k}**x**

_{k−1}) + 𝗞

_{k}[

**y**

_{k}− 𝗛

*(*

_{k}M_{k}**x**

_{k−1})],and 𝗞

_{k}=

**Σ**

_{k}𝗛

_{k}

^{T}𝗥

_{k}

^{−1}. Here

*n*(

**x**

_{k}−

**x**

*,*

_{k}^{a}**Σ**

_{k}) ∝ exp[−½(

**x**

_{k}−

**x**

*)*

_{k}^{a}^{T}(

**Σ**

_{k})

^{−1}(

**x**

_{k}−

**x**

*)] is the pdf of*

_{k}^{a}**x**

*,*

_{k}^{a}**Σ**

_{k}), the multivariate Gaussian distribution with mean

*x*and covariance matrix

_{k}^{a}**Σ**

*. The updating recursive law on the weights is simplyand, on the previous assumptions, can be computed explicitly since it is Gaussian:Contrary to the bootstrap filter, only particles with a reasonable likelihood given the observations will be sampled. We have tested this optimal importance sampling particle filter (OISPF) on the Lorenz-95 model, using the exact same setup as in section 2, but with the original*

_{k}*N*= 40 system. Performance tests that compare the merits of the EnKF, the bootstrap filter, and the OISPF are reported in Fig. 5. The OISPF improvement is spectacular for moderate ensemble size, as compared to the bootstrap filter, but insufficient beyond.

**x**

_{k}and 𝗣

*be the analyzed state and analysis error covariance matrix, respectively, of the Gaussian filter: either extended Kalman filter, unscented Kalman filter, ensemble Kalman filter, or ensemble transform Kalman filter, etc. The proposal would then be of the following form:This leads to other variants of particle filter with a guiding BLUE-based proposal.*

_{k}*as in any ensemble-based method. The proposal of Eq. (49) isbut with 𝗞*

_{k}^{f}*now given by Eq. (22), and*

_{k}**Σ**

*given byContrary to the OISPF, there is an implicit approximation in the use of 𝗣*

_{k}*in 𝗞*

_{k}^{f}*, as it uses the positions of all particles. Making use of this analysis, Papadakis (2007) proposed to use an ensemble-based Kalman filter as a*

_{k}*channeling*Gaussian filter. That way, he can define a (tentatively) true particle filter (meaning with a proper Bayesian asymptotic limit) with weights attached to each member of the ensemble. After the analysis, the weights are updated thanks to Eq. (55), and the ensemble is possibly resampled. The forecast that follows the analysis is the typical step of an ensemble Kalman filter, using the particles of the filter as ensemble members. The only difference with the EnKF lies in the weights that must be taken into account in the estimation of the mean and error covariance matrix:where

Two words of caution are in order about the WEnKF. Firstly, the particles are interacting (following the terminology of Del Moral 2004), not only through standard resampling, but also through the estimation of the covariance matrix. Second, as a Gaussian pdf the proposal function has strongly vanishing tails. Therefore, on one hand, the WEnKF should be very efficient in the regime where the EnKF outperforms particle filters as compared to other particle filters. On the other hand, it may be weaker in a regime where simpler particle filters outperform the EnKF. Thus, we believe that the overall interest in the WEnKF is debatable, even though very appealing.

### e. Nonperturbative non-Gaussian methods for high-dimensional linear models

There are relevant geophysical cases where the models are approximately linear. But the priors may be intrinsically different from Gaussians. This is the case for tracer transport, radionuclides dispersion, dust, several greenhouse gases including CO_{2}, etc. In that context (i.e., linear models and non-Gaussian priors), and unlike previous examples, a non-Gaussian analysis can be performed thoroughly without approximation, using nonlinear convex analysis (see Borwein and Lewis 2000). The theory is based on nonquadratic cost-functions that generalize 4D-Var and the Physical space Statistical Analysis System (PSAS) in the specific linear model case (Bocquet 2005b,c, 2007; Krysta and Bocquet 2007; Bocquet 2008).

**x**∈ ℝ

*with model/observation error*

^{N}**∈ ℝ**

*υ**:where 𝗛 ∈ ℝ*

^{d}^{d×N}is the Jacobian that combines the observation operator and the model linking the observations to the forcing field or initial condition. The joint prior pdf in control space and in observation error space is

*ν*(

**x**,

**). Finding the posterior pdf**

*υ**p*(

**x**,

**) is the goal of the analysis in this framework. However, the object**

*υ**p*(

**x**,

**), might be too complex and of difficult interpretation. That is why an estimator must be chosen that extracts some precise information from the full pdf mode also known as maximum a posteriori (MAP), the mean value estimator, etc.**

*υ*#### 1) Bayesian inference and maximum a posteriori

**x**. In a Gaussian context, this would give back the usual 4D-Var cost function. The dual cost functioncould be equivalently solved. The

**∈ ℝ**

*λ**are the Lagrange multipliers that enforce Eq. (58). Here*

^{d}*ζ** is the Legendre-Fenchel conjugate of

*ζ*, and is defined byIf

*ν*is Gaussian, one obtains the PSAS formalism (Courtier 1997; Cohn et al. 1998).

However, such a dual equivalence is possible only if

#### 2) Maximum entropy on the mean inference

*ν*and of the posterior pdf

*p*. The optimal

*p*is the one that minimizes the gain of information (maximizes the entropy), except for what is gained from the observations. Since this gain of information is objectively measured by the Kullback–Leibler divergence, the related cost function (often called level 2 primal cost function) iswhere

*[·] = Σ*

_{p}_{x,υ}

*p*(

**x**,

**) · is the expectation operator, and Σ**

*υ*_{x,υ}is a symbol for a sum (discrete variables) or an integral (continuous variables). The constraint that

*p*should satisfy the observation on the mean is enforced through Lagrange multipliers

**∈ ℝ**

*λ**. The direct optimization of*

^{d}*p*leads to the intermediary:up to the normalization constant of the pdf

*p*. By inserting

_{λ}*p*into Eq. (62), one obtains the dual cost function [similarly to Eq. (60)], which depends on

_{λ}*:where*

**λ***ν̂*is the log-Laplace transform of

*ν*, and is defined byBy evaluating the pdf in Eq. (63) at the the minimum

*λ**p*

_{λ}(

**x**,

**) from which the most sensible estimators that can be derived are the state mean**

*υ***x**

*v**level-1*primal cost function:Contrary to the strictly Bayesian and MAP inference, the cost functions in Eqs. (62), (64), and (66) are always convex by construction because of the convexity of the Kullback–Leibler functional on a vector subspace, which was mentioned in section 4. Therefore, these cost functions have always a single minimum, and the primal and dual cost functions are always equivalent (there is no gap in between the minimum of

*ν*is Gaussian, 4D-Var and PSAS cost functions for linear assumptions are recovered. The schematic of the different transformations and equivalences is displayed in Fig. 6. Note that the primal part of the formulation can be generalized to nonlinear models, but the duality correspondence is not valid any more.

An application to the forecast of an accidental plume of pollutant is given in the context of the European tracer experiment (Nodop et al. 1998) in Fig. 7. About 10^{3} observations are used for 2 × 10^{4} control variables. The analyses are hence conveniently carried out in the observation space. The plume contours obtained by the MEM method are much finer than with 4D-Var, which is of utmost importance for such a dispersion event, especially in the regions where the concentration field exhibits strong gradients.

The strictly Bayesian solution of the previous section is different from the MEM solution: the exponential pdf in Eq. (63) is generally not the posterior pdf obtained from Bayes’s rule. Instead, the MEM method convexifies the objective function that would be obtained from Bayes’s rule. Therefore, if the existence of multiple minima matters in the problem, the MEM approach may differ significantly from the strictly Bayesian solution. However, the analysis of multiple minima in geophysical applications is rather speculative. This would likely happen when considering a strongly constrained 4D-Var, with a sufficiently large assimilation window. Rather than facing such a multiple minima optimization problem, it is tempting to convexify anyway the objective function by, for instance, incorporating model error (weakly constrained 4D-Var). The problem of convexification is the arbitrariness of the regularization of penalty functions, which can result in an unlikely solution. However, for pollutant source reconstruction problems, Bocquet (2008) has shown that the difference between the two approaches is small.

So far, this discussion was relevant for the state estimation problem. For second-order and higher-order moments, the MEM method would not give the correct estimates even in the Gaussian prior case. To circumvent this drawback of the MEM method, Bocquet (2008) showed that correctly defined moments of the MEM inference with a prior *ν* could be obtained from the strictly Bayesian inference but using the prior exp [−(*ν̂*)*]. Within this extension of the MEM method, a precise correspondence is defined between the two approaches.

#### 3) Second-order sensitivity analysis

**identifies with the part of the cost function attached to the background departure. If the prior pdf splits according to**

_{x}*ν*(

**x**,

**) =**

*υ**ν*

**(**

_{x}**x**)

*ν*(

_{υ}**), the total cost function splits according toHere**

*υ***,**

_{x}*can be either analytically or numerically computed following for instance in the case of*

_{υ}**:Note that in the Gaussian case, one recovers the natural splitting of cost functions into signal and noise degrees of freedom (Rodgers 2000). In this non-Gaussian context, each one of these**

_{x}

_{y}**is interpreted as the marginal gain in information induced by the variation of the measurements.**

_{x}This non-Gaussian second-order analysis is illustrated with an original experiment on the inversion of the Chernobyl accident radionuclides source term. The physics is essentially linear so that the methodology applies without approximation. The positivity of the source term requires a non-Gaussian background error prior modeling. Figure 8 illustrates several sensitivities in both Gaussian and non-Gaussian analysis cases. The global marginal gain of information ∂_{y}_{x}_{,υ} = *λ*_{y}** _{x}** that goes into the reconstruction of the source (the signal part) significantly depends on the statistical modeling of the prior. For instance, the Scandinavian observations are relatively less significant in the non-Gaussian case than in the Gaussian case. Put differently, the information content of remote observations is stronger in the non-Gaussian case. This may be due to the positivity constraint that rules out the sources with both negative and positive rates that are more compatible with these remote observations.

#### 4) Score

**x**∼

**x**

*, 𝗕) and*

^{b}*∼*

**υ****x**

*and*

^{t}*are the true state and error vector, respectively. The norm ‖·‖*

**υ**^{t}_{𝗔}, with 𝗔 a positive definite matrix, is defined by ‖

**x**‖

_{𝗔}

^{2}=

**x**

^{T}𝗔

**x**. The transformation from the second to the third member is due to the Pythagorean theorem. The analysis (

**x**

*υ**⊕ ℝ*

^{N}*is the orthogonal projection of (*

^{p}**x**

*,*

^{b}^{−1}⊕ 𝗥

^{−1}) on the hyperplane of couples (

**x**,

*) such that*

**υ****y**= 𝗛

**x**+

*. This is equivalent to the geometrical interpretation of the analysis in terms of projection using the Mahalanobis norm as a scalar product (Desroziers et al. 2005; Chapnik et al. 2006). This ensures that 0 ≤*

**υ***ρ*≤ 1, with

*ρ*= 1 when the reduction of uncertainty is maximum.

*p*

_{x,υ}and

*p*

_{xt,υt}are the pdfs, belonging to the exponential family Eq. (63), whose state and errors averages are

**x**

*υ***x**

*and*

_{t}*in the latter case. The Kullback–Leibler terms,*

**υ**_{t}*p*

_{x,υ},

*ν*) and

*p*

_{xt,υt},

*ν*), are expressed here in their abstract (level 2) form, but they can numerically be estimated using their primal form in Eq. (66). In particular,

*p*

_{x,υ},

*ν*) often identifies with the minimum of the objective function, and is just a numerical by-product of a data assimilation variational scheme. In the Gaussian case and with independent background and error priors, Eq. (72) simplifies to Eq. (71). This result was useful in the evaluation of a European radionuclides monitoring network whose observations were assimilated in a set of OSSEs (Krysta and Bocquet 2007).

## 7. Perspectives

The theoretical and numerical Bayesian solutions to the estimation problem have been shown not only to be appealing but also quite natural. Yet, the computational complexity that prevents their use ultimately leads to the success of 4D-Var and the ensemble Kalman filter. However, non-Gaussianity generated by the nonlinearities of model or intrinsic to the priors has not vanished. The ever-increasing computing power and widespread parallel architectures, even on cheap systems, make the use of more sophisticated applied mathematical solutions tempting. Nevertheless, the complexity of the estimation from these solutions does not scale reasonably (e.g., linearly) with the high dimensionality and complexity of geophysical systems, and relying on computing power is insufficient.

In this review, it has been shown that one can use sophisticated tools to diagnose non-Gaussianity in data assimilation with concepts inherited from statistics and information theory. More importantly, several examples of solution were given with promising performances. A few were of perturbative nature using an expansion of the Gaussian analysis system (weakly non-Gaussian prior construction). A few were of nonperturbative nature with little approximation (e.g., maximum entropy filter and Gaussian anamorphosis). Others were of fully nonperturbative nature taking advantage of Gaussian analysis guidance, or some linearity in the models (e.g., the optimal importance function particle filter, the weighted ensemble Kalman filter, and maximum entropy variational inference). These examples all remain specific, either because they rely on an assumption difficult to generalize, or because they were only tested on relatively low-dimensional systems so far. However, they can already be used on highly nonlinear subsystems of larger geophysical systems, such as Lagrangian drifters in a flow, or a submanifold of the dynamics. Alternatively, they can be used on real applications that do possess simplifying features, such as model linearity.

Increasing computing power might not only serve advanced data assimilation techniques, but also allows one to process more observations (denser coverage in space and time), and finer model resolution. As a consequence, models may become locally more and more linear and more and more Gaussian between analyses. However, this argument has been partly refuted by the nonlinearity of small-scale physics, in conjunction with the fundamentally multiscale nature of geophysical systems. That is why we believe proper handling of non-Gaussianity will remain an important issue.

We think that more general solutions for high-dimensional systems than those exemplified in this review will require the simultaneous reduction of the model’s dynamical degrees of freedom, spatial dividing–localization strategies, and an efficient sampling strategy (in connection with model error characterization). Particle filters or variants that are more advanced will then eventually be useful.

The number of possibilities to build up new solutions is tremendous, especially for filtering. It is to be expected that more and more applications will make use of an increasing number of theories mixing sequential and variational approaches, or combining Gaussian analysis and fully Bayesian ones. We expect that comparisons of all these methodologies and contexts will be a (very) difficult task. Theoretical and general guidance will then be needed to sort them all, with both mathematical analysis and the use of high-dimensional geophysical benchmarking models.

## Acknowledgments

M. Bocquet is grateful to E. Kalnay and L. Fillion, organizers of the WWRP/THORPEX workshop on “4D-VAR and Ensemble Kalman Filter Intercomparisons,” held in Buenos Aires, Argentina, November 2008. The paper follows the overview of the session “Issues of Nonlinearity and non-Gaussianity” presented at this workshop. The authors are indebted to an anonymous reviewer, C. Snyder, and H. L. Mitchell for their substantial and thorough suggestions that helped to improve the manuscript. M. Bocquet acknowledges stimulating discussions with P. J. van Leeuwen and O. Pannekoucke. Finally, the authors thank M. Krysta and L. Delle Monache for their careful reading of the manuscript and for their useful comments.

## REFERENCES

Anderson, J. L., , and S. L. Anderson, 1999: A Monte Carlo implementation of the nonlinear filtering problem to produce ensemble assimilations and forecasts.

,*Mon. Wea. Rev.***127****,**2741–2758.Anderson, T. W., , and D. A. Darling, 1952: Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes.

,*Ann. Math. Stat.***23****,**193–212.Andersson, E., , and H. Järvinen, 1999: Variational quality control.

,*Quart. J. Roy. Meteor. Soc.***125****,**697–722.Andersson, E., , M. Fisher, , E. Hólm, , L. Isaksen, , G. Radnóti, , and Y. Trémolet, 2005: Will the 4D-Var approach be defeated by nonlinearity? Tech. Rep. 479, ECMWF, 28 pp. [Available online at http://www.ecmwf.int/publications/library/ecpublications/_pdf/tm/401-500/tm479.pdf].

Andersson, E., and Coauthors, 2007: Analysis and forecast impact of the main humidity observing systems.

,*Quart. J. Roy. Meteor. Soc.***133****,**1473–1485.Auroux, D., 2007: Generalization of the dual variational data assimilation algorithm to a nonlinear layered quasi-geostrophic ocean model.

,*Inverse Probl.***23****,**2485–2503.Barndorff-Nielsen, O. E., , and D. R. Cox, 1989:

*Asymptotic Techniques for Use in Statistics*.*Meteor. Monogr.,*No. 31, Chapman & Hall, 252 pp.Bellman, R., 1961:

*Adaptive Control Processes: A Guided Tour*. Princeton University Press, 255 pp.Bengtsson, T., , C. Snyder, , and D. Nychka, 2003: Toward a nonlinear ensemble filter for high-dimensional systems.

,*J. Geophys. Res.***108****,**8775. doi:10.1029/2002JD002900.Berliner, M. L., , and C. K. Wikle, 2007: Approximate importance sampling Monte Carlo for data assimilation.

,*Physica D***230****,**37–49.Bertino, L., , G. Evensen, , and H. Wackernagel, 2003: Sequential data assimilation techniques in oceanography.

,*Int. Stat. Rev.***71****,**223–241.Bishop, C. H., , B. J. Etherton, , and S. J. Majumdar, 2001: Adaptive sampling with the ensemble transform Kalman filter. Part I: Theoretical aspects.

,*Mon. Wea. Rev.***129****,**420–436.Bocquet, M., 2005a: Grid resolution dependence in the reconstruction of an atmospheric tracer source.

,*Nonlinear Processes Geophys.***12****,**219–234.Bocquet, M., 2005b: Reconstruction of an atmospheric tracer source using the principle of maximum entropy. I: Theory.

,*Quart. J. Roy. Meteor. Soc.***131****,**2191–2208.Bocquet, M., 2005c: Reconstruction of an atmospheric tracer source using the principle of maximum entropy. II: Applications.

,*Quart. J. Roy. Meteor. Soc.***131****,**2209–2223.Bocquet, M., 2007: High resolution reconstruction of a tracer dispersion event.

,*Quart. J. Roy. Meteor. Soc.***133****,**1013–1026.Bocquet, M., 2008: Inverse modelling of atmospheric tracers: Non-Gaussian methods and second-order sensitivity analysis.

,*Nonlinear Processes Geophys.***15****,**127–143.Borwein, J. M., , and A. S. Lewis, 2000:

*Convex Analysis and Nonlinear Optimization: Theory and Examples*. Springer, 273 pp.Buehner, M., , P. L. Houtekamer, , C. Charette, , H. L. Mitchell, , and B. He, 2010: Intercomparison of variational data assimilation and the ensemble Kalman filter for global deterministic NWP. Part II: One-month experiments with real observations.

,*Mon. Wea. Rev.***138****,**1567–1586.Buizza, R., , and A. Montani, 1999: Targeted observations using singular vectors.

,*J. Atmos. Sci.***56****,**2965–2985.Burgers, G., , P. J. van Leeuwen, , and G. Evensen, 1998: Analysis scheme in the ensemble Kalman filter.

,*Mon. Wea. Rev.***126****,**1719–1724.Cane, M. A., , A. Kaplan, , R. N. Miller, , B. Y. Tang, , E. C. Hackert, , and A. J. Busalacchi, 1996: Mapping tropical Pacific sea level: Data assimilation via a reduced state space Kalman filter.

,*J. Geophys. Res.***101****,**(C10). 22599–22617.Carrassi, A., , A. Trevisan, , and F. Uboldi, 2007: Adaptive observations and assimilation in the unstable subspace by breeding on the data-assimilation system.

,*Tellus***59A****,**101–113.Carrassi, A., , M. Ghil, , A. Trevisan, , and F. Uboldi, 2008: Data assimilation as a nonlinear dynamical systems problem: Stability and convergence of the prediction-assimilation system.

,*Chaos***18****,**023112.Chapnik, B., , G. Desroziers, , F. Rabier, , and O. Talagrand, 2006: Diagnosis and tuning of observational error in a quasi-operational data assimilation setting.

,*Quart. J. Roy. Meteor. Soc.***132****,**543–565.Cohn, S. E., 1997: An introduction to estimation theory.

,*J. Meteor. Soc. Japan***75****,**257–288.Cohn, S. E., , A. da Silva, , J. Guo, , M. Sienkiewicz, , and D. Lamich, 1998: Assessing the effects of data selection with the DAO physical-space statistical analysis system.

,*Mon. Wea. Rev.***126****,**2913–2926.Courtier, P., 1997: Dual formulation of four-dimensional variational assimilation.

,*Quart. J. Roy. Meteor. Soc.***123****,**2449–2461.Courtier, P., , and O. Talagrand, 1987: Variational assimilation of meteorological observation with the adjoint vorticity equation. II: Numerical results.

,*Quart. J. Roy. Meteor. Soc.***113****,**1329–1347.Cover, T. M., , and J. A. Thomas, 1991:

*Elements of Information Theory*. Wiley Series in Telecommunications, Wiley-Interscience, 542 pp.Daescu, D. N., , and I. M. Navon, 2004: Adaptive observations in the context of 4D-Var data assimilation.

,*Meteor. Atmos. Sci.***85****,**205–226.Davoine, X., , and M. Bocquet, 2007: Inverse modelling-based reconstruction of the Chernobyl source term available for long-range transport.

,*Atmos. Chem. Phys.***7****,**1549–1564.Del Moral, P., 2004:

*Feynman–Kac Formulae: Genealogical and Interacting Particle Systems with Applications*. Springer-Verlag, 566 pp.Desroziers, G., , L. Berre, , B. Chapnik, , and P. Poli, 2005: Diagnosis of observation, background and analysis-error statistics in observation space.

,*Quart. J. Roy. Meteor. Soc.***131****,**3385–3396.Doucet, A., , S. Godsill, , and C. Andrieu, 2000: On sequential Monte Carlo sampling methods for Bayesian filtering.

,*Stat. Comput.***10****,**197–208.Doucet, A., , N. de Freitas, , and N. Gordon, Eds. 2001:

*Sequential Monte Carlo Methods in Practice*. Springer-Verlag, 612 pp.Evensen, G., 1994: Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics.

,*J. Geophys. Res.***99****,**(C5). 10143–10162.Evensen, G., 1997: Advanced data assimilation for strongly nonlinear dynamics.

,*Mon. Wea. Rev.***125****,**1342–1354.Evensen, G., 2003: The ensemble Kalman filter: Theoretical formulation and practical implementation.

,*Ocean Dyn.***53****,**343–367.Eyink, G. L., , and S. Kim, 2006: A maximum entropy method for particle filtering.

,*J. Stat. Phys.***123****,**1071–1128.Fisher, M., , M. Leutbecher, , and G. A. Kelly, 2005: On the equivalence between Kalman smoothing and weak-constraint four-dimensional variational data assimilation.

,*Quart. J. Roy. Meteor. Soc.***131****,**3235–3246.Fletcher, S. J., , and M. Zupanski, 2006: A data assimilation method for log-normally distributed observational errors.

,*Quart. J. Roy. Meteor. Soc.***132****,**2505–2519.Fujita, T., , D. J. Stensrud, , and D. C. Dowell, 2007: Surface data assimilation using an ensemble Kalman filter approach with initial condition and model physics uncertainties.

,*Mon. Wea. Rev.***135****,**1846–1868.Gardiner, C. W., 2004:

*Handbook of Stochastic Methods: For Physics, Chemistry and the Natural Sciences*. 3rd ed. Springer Series in Synergetics, Springer, 415 pp.Gaspari, G., , and S. E. Cohn, 1999: Construction of correlation functions in two and three dimensions.

,*Quart. J. Roy. Meteor. Soc.***125****,**723–757.Gauthier, P., 1992: Chaos and quadri-dimensional data assimilation: A study based on the Lorenz model.

,*Tellus***44A****,**2–17.Gilks, W. R., , and C. Berzuini, 2001: Following a moving target-Monte Carlo inference for dynamic Bayesian models.

,*J. Roy. Stat. Soc. B***63****,**127–146.Gordon, N. J., , D. J. Salmond, , and A. F. M. Smith, 1993: Novel approach to nonlinear/non-Gaussian Bayesian state estimation.

,*IEE Proc. F***140****,**107–113.Hamill, T. M., , J. S. Whitaker, , and C. Snyder, 2001: Distance-dependent filtering of background error covariance estimates in an ensemble Kalman filter.

,*Mon. Wea. Rev.***129****,**2776–2790.Handschin, J., , and D. Mayne, 1969: Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering.

,*Int. J. Control***9****,**547–559.Harlim, J., , and B. R. Hunt, 2007: A non-Gaussian ensemble filter for assimilating infrequent noisy observations.

,*Tellus***59A****,**225–237.Haven, K., , A. Majda, , and R. Abramov, 2005: Quantifying predictability through information theory: Small sample estimation in a non-Gaussian framework.

,*J. Comput. Phys.***206****,**334–362.Heemink, A. W., , M. Verlaan, , and A. J. Segers, 2001: Variance reduced ensemble Kalman filtering.

,*Mon. Wea. Rev.***129****,**1718–1728.Hólm, E., 2007: Humidity control variable and supersaturation.

*Proc. Workshop on Flow-Dependent Aspects of Data Assimilation,*Reading, United Kingdom, ECMWF, 143–150.Hólm, E., , E. Andersson, , A. Beljaars, , P. Lopez, , J-F. Mahfouf, , A. J. Simmons, , and J-N. Thépaut, 2002: Assimilation and modelling of the hydrological cycle: ECMWF’s status and plans. Tech. Rep. 383, ECMWF, 57 pp. [Available online at http://www.ecmwf.int/publications/library/ecpublications/_pdf/tm/301-400/tm383.pdf].

Hoteit, I., 2008: A reduced-order simulated annealing approach for four-dimensional variational data assimilation in meteorology and oceanography.

,*Int. J. Numer. Methods Fluids***58****,**1181–1199.Hoteit, I., , D-T. Pham, , G. Triantafyllou, , and G. Korres, 2008: A new approximate solution of the optimal nonlinear filter for data assimilation in meteorology and oceanography.

,*Mon. Wea. Rev.***136****,**317–334.Houtekamer, P. L., , and H. L. Mitchell, 1998: Data assimilation using an ensemble Kalman filter technique.

,*Mon. Wea. Rev.***126****,**796–811.Houtekamer, P. L., , and H. L. Mitchell, 2001: A sequential ensemble Kalman filter for atmospheric data assimilation.

,*Mon. Wea. Rev.***129****,**123–137.Houtekamer, P. L., , and H. L. Mitchell, 2005: Ensemble Kalman filtering.

,*Quart. J. Roy. Meteor. Soc.***131****,**3269–3289.Houtekamer, P. L., , H. L. Mitchell, , and X. Deng, 2009: Model error representation in an operational ensemble Kalman filter.

,*Mon. Wea. Rev.***137****,**2126–2143.Huber, P. J., 1973: Robust regression: Asymptotics, conjectures, and Monte Carlo.

,*Ann. Stat.***1****,**799–821.Hunt, B. R., and Coauthors, 2004: Four-dimensional ensemble Kalman filtering.

,*Tellus***56A****,**273–277.Hyvärinen, A., , and E. Oja, 2000: Independent component analysis: Algorithms and applications.

,*Neural Networks***13****,**411–430.Ide, K., , P. Courtier, , M. Ghil, , and A. Lorenc, 1999: Unified notation for data assimilation: Operational, sequential and variational.

,*J. Meteor. Soc. Japan***75****,**181–189.Jazwinski, A. H., 1970:

*Stochastic Processes and Filtering Theory*. Academic Press, 376 pp.Kalnay, E., , H. Li, , T. Miyoshi, , S-C. Yang, , and J. Ballabrera, 2007: 4D-Var or ensemble Kalman filter.

,*Tellus***59A****,**758–773.Kitagawa, G., 1987: Non-Gaussian state-space modeling of nonstationary time series.

,*J. Amer. Stat. Assoc.***82****,**1032–1063.Kleeman, R., 2002: Measuring dynamical prediction utility using relative entropy.

,*J. Atmos. Sci.***59****,**2057–2072.Kleeman, R., 2007: Statitical predictibility in the atmosphere and other dynamical systems.

,*Physica D***230****,**65–71.Krüger, J., 1993: Simulated annealing: A tool for data assimilation into an almost steady model state.

,*J. Phys. Oceanogr.***23****,**679–688.Krysta, M., , and M. Bocquet, 2007: Source reconstruction of an accidental radionuclide release at European scale.

,*Quart. J. Roy. Meteor. Soc.***133****,**529–544.Kullback, S., 1959:

*Information Theory and Statistics*. Wiley, 395 pp.Laroche, S., , and P. Gauthier, 1998: A validation of the incremental formulation of 4D variational data assimilation in a nonlinear barotropic flow.

,*Tellus***50A****,**557–572.Lauvernet, C., , J-M. Brankart, , F. Castruccio, , G. Broquet, , P. Braseur, , and J. Verron, 2009: A truncated Gaussian filter for data assimilation with inequality constraints: Application to the hydrostatic stability condition in ocean models.

,*Ocean Modell.***27****,**1–17.Lawson, W. G., , and J. A. Hansen, 2004: Implications of stochastic and determinisitic filters as ensemble-based data assimilation methods in varying regimes of error growth.

,*Mon. Wea. Rev.***132****,**1966–1981.Le Dimet, F-X., , and O. Talagrand, 1986: Variational algotrithms for analysis and assimilation of meteorological observations: Theoretical aspects.

,*Tellus***38A****,**97–110.Le Dimet, F-X., , H-E. Ngodock, , B. Luong, , and J. Verron, 1997: Sensitivity analysis in variational data assimilation.

,*J. Meteor. Soc. Japan***75****,**245–255.Leith, C. E., 1974: Theoretical skill of Monte Carlo forecast.

,*Mon. Wea. Rev.***102****,**409–418.Lermusiaux, P. F. J., , and A. R. Robinson, 1999: Data assimilation via error subspace statistical estimation. Part I: Theory and schemes.

,*Mon. Wea. Rev.***127****,**1385–1407.Lilliefors, H. W., 1967: On the Kolmogorov–Smirnov test for normality with mean and variance unknown.

,*J. Amer. Stat. Assoc.***62****,**399–402.Lions, J-L., , O. P. Manley, , R. Temam, , and S. Wang, 1997: Physical interpretation of the attractor dimension for the primitive equations of atmospheric circulation.

,*J. Atmos. Sci.***54****,**1137–1143.