1. Introduction
a. The context
This is a review of the non-Gaussian aspects of data assimilation, in the context of geophysics. Through references and a few examples, it investigates the difficulties in producing analyses using statistical modeling that goes beyond Gaussian hypotheses or beyond second-order moment closure. The emphasis is on the concepts and promising ideas, rather than on the technicalities or the completeness of the bibliography. However, mathematical details will be given when necessary or appealing. Examples, original in some cases, will be provided using simple models.
Nonlinearity and non-Gaussianity are interlaced topics, and it is difficult to discuss only one facet of the problem, neglecting the other. For instance, nonlinarities of a dynamical model inevitably produce non-Gaussian priors to be used in later analyses. Nevertheless, the focus of this review is more on the non-Gaussian aspects from the statistical modeling viewpoint, and on ways to extend the usual Gaussian analysis of current data assimilation orthodoxy. There are reviews, or relevant reports, that focus more on the nonlinear aspects but less on modeling the non-Gaussian statistics (Miller et al. 1994; Evensen 1997; Verlaan and Heemink 2001; Andersson et al. 2005). The intended scope of the article is broad: meteorology, oceanography, and atmospheric chemistry. Non-Gaussianity may take many forms there, and does not necessarily always come from the dynamics. However, a commonality of these fields is the very large dimension of the state and observation vector spaces. At first glance, this rules out most of the sophisticated probabilistic mathematical methods that are meant to estimate the full state probability density function (pdf), or higher-order moments, or to provide a state estimate without any approximation.
b. Statistical modeling of the estimation problem
A data assimilation system consists of a set of observations, and of a numerical model, that may be static or dynamical, that may be deterministic or stochastic, and that represents the underlying physics. The mathematical modeling of uncertainty for this system implies that one embeds models and observations in a statistical framework and provides uncertainty input about them. From the uncertainty of the components of this data assimilation system, one can ultimately infer, not only an estimate of the true state, but also the uncertainty of that estimate.
This uncertainty could originate from the imprecise initial state of the system. It could also stem from the more or less precise identification of forcings of the dynamical systems, such as emission fields (in atmospheric chemistry), radiative forcing, boundary conditions, and couplings to other models that may be imperfect. The deficiency of the model itself is another source of uncertainty. To account for this type of uncertainty, models could explicitly be made probabilistic. This occurs when some stochastic forcing is implemented to represent subgrid-scale processes in Eulerian models, or when stochastic particles are simulated to represent dispersion in Lagrangian models. The uncertainty could also come from the observations in the form of representativeness or instrumental errors, or indirectly from the models and algorithms used to filter these observations through quality control. Finally, in the case of remote sensing, it could stem from the joint use of a model (a radiative transfer model for instance) and an algorithm that infers data from indirect measurements.
The proper statistical modeling depends on how uncertainty evolves under the full data assimilation system dynamics. In particular, in the context of forecasting, this modeling should properly account for the uncertainty growth–reduction cycle, which is controlled by the forecast–analysis steps of the data assimilation cycle.
Truncating statistics to the first- and second-order moments (bias and error covariance matrix) may be made necessary because of the complexity of the fully Bayesian data assimilation algorithms. It also reduces the computer storage of higher moments, which are gigantic objects in geophysical systems. This truncation may also be justified from the point of view of the evolution of the dynamical model. If, in the vicinity of a trajectory, the model can be replaced by its tangent linear approximation, then initial Gaussian statistics will remain so in this vicinity. Unfortunately, the statistics would diverge from ideal Gaussianity when the model is strongly nonlinear, when the analyses are infrequent, or when the observational data are sparse (Pires et al. 1996; Evensen 1997).
c. Outline of the review of ideas
In this article, these arguments will be developed. At first, the reasons why researchers in geophysical data assimilation have avoided exact Bayesian modeling, even though it may be more natural, will be reviewed (section 2).
Having accepted that a fully Bayesian approach may not be computationally affordable for geophysical or large environmental problems, Gaussian filtering or variational approaches have been developed with success in geophysical data assimilation. Then the reasons why non-Gaussianity is often bound to reemerge in the statistical modeling of well-behaved geophysical problems are detailed. Next, we will briefly describe the strategies that have been employed, especially in four-dimension variational data assimilation (4D-Var) and the ensemble Kalman filter, to accommodate possible non-Gaussian deviations in the data assimilation system (section 3). However, this is not the main focus of this review, since literature offers excellent discussion papers on these topics (Kalnay et al. 2007). Ways to objectively measure the deviations from Gaussianity will then be discussed (section 4).
In section 5, the following strategies to better control the uncertainty and to make it less non-Gaussian are examined: targeting of observations, specific treatment of highly nonlinear degrees of freedom, localization, and model reduction.
In section 6, we will review a selection of new ideas to make use of non-Gaussian statistical modeling in data assimilation, either perturbatively (about a Gaussian formalism), or nonperturbatively (without direct reference to a Gaussian formalism). Several original examples will serve as illustrations.
To conclude, the future of these ideas and concepts is discussed (section 7).
2. Why not non-Gaussian from the start?
A Gaussian modeling of the uncertainty in geophysical data assimilation problems may not be the most natural approach to begin with. A more natural approach would be to forget about the constraints such as the dimensionality of the state space of geophysical models. In a constraint-free context, rigorous methods have been proposed in applied mathematics to solve analytically or numerically the fully nonlinear estimation problem, either in discrete or continuous time.
a. The estimation problem
Within this probabilistic framework, one could either be interested in estimating the true state of the system, along with its uncertainty, at the present time, or be interested in estimating the true state for all times.
b. Nonlinear statistical time-continuous estimation
From an algorithmic standpoint, by making the knowledge of a dynamical system a probabilistic one, the objects to deal with are not any more in ℝN (estimate of a state vector). Rather, they are functions p(x) of N variables, and possibly time in the smoothing case. This stresses the change of scale in complexity. The mathematics required to account for these problems do exist but their efficiency is questionable when applied to high-dimensional geophysical systems.
c. Particle filters and their curse in high-dimensional systems
When it comes to numerically solving the fully Bayesian filtering (possibly smoothing) problem, the most popular approaches are based on Monte Carlo sampling (Handschin and Mayne 1969; Kitagawa 1987). Sequential Monte Carlo methods to solve these nonlinear filtering equations are called particle filters. A quite complete and very clear review of the potential applications of particle filters to the geophysical estimation problems was offered by van Leeuwen (2009).
During the forecast step from time tk to time tk+1, the particles are propagated with the model xk+1 = Mk+1(xk) + wk+1. In the context of the bootstrap filter, the pdf is then simply updated at time tk+1 and satisfies Eq. (12) but at time tk+1. This completes the particle filter cycle.
Unfortunately, for high-dimensional systems, most of the weights vanish and just a few particles remain likely (see e.g., Berliner and Wikle 2007). Therefore, in this case, the particle filter becomes useless for the estimation problem.
The ensemble size required for a proper estimation has been shown to scale exponentially with the system size, or the innovation variance. To prove this, Snyder et al. (2008) studied the statistics of the biggest weight. They demonstrated on a simple Gaussian model that the required size of the ensemble scales like M ∼ exp(τ 2/2), where τ 2 is the variance of the observation log-likelihood. Under simple independence assumptions, it is expected to scale like the dimension of observation space on one hand, and the dimension of state space on the other.
This behavior is related to the so-called curse of dimensionality (Bellman 1961). Consider one of the particles, a state space vector in ℝN. It is meant to be representative of a volume [−ε, ε]N of state space centered on that particle. However, particles similar to that particle are close to it according to the Euclidean metric: they lie in a neighborhood, say within a distance of ε. In high-dimensional systems, a representativeness issue arises because of the shrinking of the hypersphere of radius ε within the hypercube [−ε, ε]N. Indeed the volume of the hypersphere relative to that of the hypercube vanishes like (π/4)N/2/Γ[(N/2) + 1] with the space dimension N. The particle is less and less representative of the cubic volume it is meant to sample. In the context of data assimilation, this implies that the observational prior and the background prior overlap less and less as the state space or observation space dimensions increase. As a consequence, most of the weights of the particles vanish, leading to a poor analysis.
To mitigate the collapse of the weights, a resampling step is often used in the bootstrap filter. Basically the idea is to draw new particles among the old ones according to the probability given by their weight. After this resampling, all the new particles have the same weight. However, it is likely that many of the new particles will be drawn from the same original particle. This will deplete the ensemble. Unless model error is already specified and of stochastic nature, it is necessary to introduce some perturbation (noise) into the forecast step, in order to enrich the ensemble.
However, the resampling does not fundamentally solve the issue since the weights still degenerate, possibly at a lower rate. The collapse of the weights could be avoided using Monte Carlo Markov chain (MCMC) techniques, based for instance on a Metropolis–Hastings selection algorithm (Gilks and Berzuini 2001). Contrary to importance sampling (and importance proposal sampling, which will be detailed later), these approaches have no exponential dependence on the dimensionality [see the illuminating discussion by MacKay (2003), chapter 29]. However, a considerable number of iterations would be required for an MCMC approach to sample a filtering pdf like those met in geophysical applications. That is why it is not clear to the authors whether these techniques will be decisive in a successful particle filter for high-dimensional systems.
d. Illustrations
As an illustration of these concepts, a bootstrap filter is compared to the ensemble Kalman filter (EnKF) of Evensen (1994), on a Lorenz-95 model (Lorenz and Emmanuel 1998). The comparison follows the methodology of Nakano et al. (2007). This EnKF is the original single-ensemble variant, corrected by Burgers et al. (1998). The Lorenz-95 model is implemented with the standard forcing parameter F = 8, and N = 10 variables, in a perfect model setting. In particular, no stochastic forcing is applied. The absence of a stochastic term does not prevent a reliable assessment of the data assimilation system on long enough runs, thanks to the ergodicity of the dynamical system.
One of every two sites is observed. The time interval between two observations is Δt = 0.05, which corresponds to a geophysical time of 6 h (Lorenz and Emmanuel 1998). The synthetic measurements are perturbed with a normal white noise of root-mean-square σ = 1.5. The root-mean-square observation error, χ, for the EnKF is also chosen to be 1.5. This places the filter in optimal conditions since the observation error prior statistics coincide with the true observation error statistics. Moreover, in order to reduce the undersampling errors in covariances for small ensemble size, two versions of the EnKF are implemented, with or without localization (Houtekamer and Mitchell 1998). Localization is carried out thanks to a tapering function (Houtekamer and Mitchell 2001; Hamill et al. 2001) of the form given by Eq. (4.10) of Gaspari and Cohn (1999), with an optimally tuned localization length. For the localization length as well as other parameters in the following, optimal values were selected via a sensitivity analysis to minimize the analysis error.
The particle filter to which the EnKF will be compared is the bootstrap filter that was described above. It is tested on the same setup and it uses the same parameters when applicable, as the ensemble Kalman filter. Contrary to the EnKF, no localization is used in the particle filter (this issue will be adressed later). For the particle filter, the prior observation error standard deviation χ is allowed to differ from the observation perturbation standard deviation σ = 1.5, since it is optimally tuned. For most of the ensemble size M range, the optimal values are close to χ = 2.5. This implicit error inflation is meant to account indirectly for the sampling error in the representation of the pdf [Eq. (12)]. However, for increasing values of M the optimal χ tends to σ (not shown).
In addition, Gaussian white (in space and time) perturbations are added to all particle state vectors with an amplitude that is optimally tuned for the sake of a fair comparison with the EnKF. Again, the optimal values are obtained thanks to a sensitivity study carried out with varying noise magnitude. Such a noise is also necessary in the EnKF case, at least for small ensemble size, since no inflation was implemented and because a single-ensemble configuration of the EnKF is used (Mitchell and Houtekamer 2009). A (residual) resampling is carried out after each analysis. The two schemes are compared on the analysis root-mean-square error (analysis rms error).
Still, the particle filter requires 104 members to match the EnKF performance. Results are shown in Fig. 1. The size of the system N = 10 was chosen so that the EnKF/bootstrap filter cross-over could be observed with a reasonable computation load.
The collapse of the particle filter with increasing state space dimension can be illustrated on the same Lorenz-95 model. Four configurations are chosen identical to the one described above, but for four system sizes N = 10, 20, 40, and 80. Figure 2 displays the empirical statistics of the maximum weight in the four cases. In the first two cases, the density is rather balanced, with high values of the weight maxima that are not overrepresented. In the last two cases, the weights degenerate. When N = 80, the mode is near 1 and the particle filter collapses (it is of no use for estimation).
Therefore, at least with basic (though not naive) algorithms, it is still unreasonable to use particle filters for the state estimation, let alone high-order moments of the errors.
e. Gaussian as a makeshift?
Admittedly, a fully non-Gaussian solution to the estimation problem may still be an intractable problem for large geophysical systems. The first nontrivial approximation to the statistical estimation problem is to truncate the error statistics to second order or to assume these statistics are approximately Gaussian. In this case, the multivariate pdf p(x), with x ∈ ℝN can be solely derived from the Gaussian correlations between pairs of variables, which are functions of two variables p(xk, xl), with 1 ≤ k, l ≤ N. These are equivalent to the full error covariance matrix. Hence, Gaussian estimation still leads to complex objects to deal with, before any reduction. It is computationally tractable only if the full covariance matrix is sampled or reduced.
Gaussian statistics have appealing properties. They are still analytically tractable in multivariate form (i.e., convolution and, integration). Their occurrence and recurrence in physical systems is supported by the central limit theorem. Moreover, Gaussians are the simplest distributions (in the sense of information theory) when only first- and second-order moments are known, as recalled by Rodgers (2000).
3. Dealing with non-Gaussianity in a Gaussian framework
This section explains the current way of dealing with nonlinearity and non-Gaussianity: reasonable data assimilation should consider non-Gaussian effects as corrections to a Gaussian analysis-based strategy. Variational approaches (4D-Var essentially) and ensemble-based Kalman filters include different approaches to account for model nonlinearities.
Sources of non-Gaussianity can be initially categorized into two families: nonlinearities in models and non-Gaussianity of priors. The latter will be emphasized here since it is the main focus of this review, but also because several recently developed methodologies are available. One has to keep in mind that this classification is arbitrary since nonlinear dynamical models produce inevitably non-Gaussian error statistics, which are often used as background error statistics.
a. Non-Gaussianity from nonlinearities in models
Non-Gaussianity results from nonlinearities in models because, under a nonlinear model transition, a Gaussian pdf becomes non-Gaussian. Nonlinearities in models may come from nonlinearity of Navier–Stokes equations leading to chaos, thresholds of microphysics (cloud, ice, rain), chemistry of atmospheric compounds (including thermodynamics and aerosols size evolution equations), increases in the model resolution that require finer physical schemes such as for precipitation at convective scale, and nonlinearity of observation operator model (remote sensing applications especially: lidar and satellite), etc. Many of these sources of nonlinearities have been discussed by Andersson et al. (2005).
b. Non-Gaussianity from priors
1) Prior modeling of state space or control space variables
Non-Gaussian priors are sometimes more adequate descriptions of the background. This is especially the case for positive variables, with large deviations about their mean. This occurs in many geophysical fields. Humidity in meteorology, species concentrations and emission inventories in atmospheric chemistry, algae population in ocean biogeochemistry, and ice and gas age in paleoglaciology ice cores are just a few examples. In the case of atmospheric chemistry, it is usually considered that the typical errors in emission inventories are of the order of 40%, before any assimilation, which rules out Gaussian modeling for positive variables. Lognormal distributions are usually used instead.
Pires et al. (2010) take the example of brightness temperature from the High Resolution Infrared Sounder (HIRS) channels to show that non-Gaussianity may also stem from the variability in the specified standard deviations of background errors (a statistical property called heteroscedasticity), in particular when aggregate statistics are used in data assimilation systems.
2) Observation error prior
For the sake of clarity, the time index k is dropped here. Rather than the noise additive yi = Hi(x) + υi, the observation equation could become yi = siHi(x), with si a strictly positive multiplicative dimensionless factor. In that case, si is a relative error.
Now that possible sources of non-Gaussianity have been recalled, classic solutions to deal with them are reviewed.
c. 4D-Var solutions to deal with nonlinearity
The 4D-Var algorithm was originally proposed to solve the data assimilation problem, described as a constrained optimization problem, using classic descent algorithms. The gradient of the cost function to be minimized can be efficiently computed by optimal control techniques (Le Dimet and Talagrand 1986; Talagrand and Courtier 1987; Courtier and Talagrand 1987).
When the model and the observation operators are linear or weakly nonlinear (in the case of small time steps of model simulation and of short assimilation time intervals), these Gaussian assumptions may hold reasonably well. However, when the assimilation windows are long enough or the models are strongly nonlinear, the Gaussian assumptions will certainly break down and the conditional pdf could become multimodal. As a result, the maximum likelihood estimates become less informative (Lorenc and Payne 2007). This is equivalent to the existence of multiple minima with the 4D-Var cost function (Gauthier 1992; Miller et al. 1994; Pires et al. 1996), while the deterministic numerical optimization seeks only one relative minimum.
To find the global minimum of
One feasible remedy is to deal with the original nonlinear optimization problem approximately by a succession of inner-loop quadratic optimization problems, in which the model is simplified (at a lower resolution with simpler physics) and linearized (Laroche and Gauthier 1998). The input of one inner-loop iteration is generated by relinearizing the original nonlinear model around the state adjusted by the output of the previous inner-loop iteration. This inner/outer approach may fail when there exist significant inner-loop linearization errors for high-resolution models and longer assimilation windows in a context of perfect models (strong-constraint 4D-Var; Trémolet 2004). This difficulty can be alleviated by introducing model errors (weak-constraint 4D-Var). Indeed, the propagation of information within the assimilation window with the tangent linear model is shortened as compared to the strong constraint 4D-Var, thanks to model error present at each time step (Andersson et al. 2005; Fisher et al. 2005; Trémolet 2006).
Another approach, the quasi-static variational assimilation, was proposed by Pires et al. (1996) for the assimilation of dense observations. The global minimum was guaranteed by progressively lengthening the assimilation periods, thus always keeping control of the first guesses when using a gradient descent method in the cost function minimization.
d. EnKF solutions to deal with nonlinearity
To close the algorithm, the a priori state
Reports about the EnKF can be found in meteorology, oceanography, hydrology, and several other fields (Evensen 2003). The reasons for this popularity are multifold. First, although many environmental systems are nonlinear and high dimensional, there exists low-dimensional subspace (local and global attractors) which represents reasonably well the complete dynamics (Lions et al. 1997; Patil et al. 2001). Thus, the pdf may be represented by a proper ensemble with a limited number of members. Second, it is well known that an ensemble forecast has the advantage against a single control forecast (Leith 1974). Finally, the Gaussian assumption at analysis time may be suitable for many scenarios (e.g., the Gaussian background errors for global numerical weather predictions as in Andersson et al. 2005).
For large systems in meteorology, the most effective EnKF schemes are those that localize the background error covariance (Houtekamer and Mitchell 2001; Hamill et al. 2001) so that spurious correlations at long distance are reduced. In other fields such as oceanography and air quality a related approach using reduced-rank Kalman filters (Cane et al. 1996; Heemink et al. 2001; Pham et al. 1998) has also been tested. Such filters work only in subspaces of the complete error space (Lermusiaux and Robinson 1999; Nerger et al. 2005). The EnKF can also be viewed as a reduced-rank Kalman filter, since the error covariance matrices are approximated by the ensemble statistics in a square root form (Tippett et al. 2003). The analysis in Eq. (20) has two implementations: a deterministic scheme (Whitaker and Hamill 2002) or with perturbations of observations for consistent error statistics (Burgers et al. 1998). They differ from each other in handling non-Gaussianity (Lawson and Hansen 2004).
Improvements to the EnKF are essentially driven by the design of better sampling strategies for the ensemble generation: the second-order exact resampling (Pham 2001), the unscented sampling (Van der Merwe et al. 2000), and the mean-preserving sampling (Sakov and Oke 2008). Increasingly, model deficiencies are simulated using ensemble members generated with different versions of the underlying forecast model (e.g., with different physical parameterizations; Meng and Zhang 2007; Fujita et al. 2007; Houtekamer et al. 2009; or with perturbations of model parameters; Wu et al. 2008).
Another idea is to bridge the gap between variational and sequential approaches, and to improve the EnKF performance (Kalnay et al. 2007) using ideas and techniques developed for 4D-Var. Such attempts are for example: the inner–outer loop to deal with nonlinearities (Kalnay et al. 2007), the variational formulation to treat the non-Gaussian error structure in observation (Zupanski 2005) and background (Harlim and Hunt 2007), and the time interpolation of the background forecasts to the observations so as to produce time-coherent assimilations of all the observations available within the assimilation window (Hunt et al. 2004; Houtekamer and Mitchell 2005). Alternatively, the 3D-Var or 4D-Var can use flow-dependent error covariances computed from the EnKF ensembles (Buehner et al. 2010), which leads to more efficient hybrid algorithms.
4. Measuring non-Gaussianity
In a Gaussian framework, one needs tools to assess the deviation from Gaussianity mainly induced by nonlinearities of the model: objective mathematical measures or statistical tests. These tools will be reviewed in this section, and, moreover, some of them will be used in section 6.
a. Relative entropy
The measure has been used in geophysical, high-dimensional applications, in predictability (Kleeman 2002), in the statistical modeling of geophysical dynamical systems (Haven et al. 2005), in inverse modeling (Bocquet 2005b,c), and in the modeling of prior pdfs (Eyink and Kim 2006; Pires et al. 2010).
The Kullback–Leibler divergence can serve as an objective function to measure deviation from Gaussianity. If p is the full pdf of the uncertainty for the system, and q ≡ pG is the Gaussian pdf that has the same first- and second-order moments, then
The pdf p could be estimated approximately by the use of an ensemble, such as the one used by ensemble-based filters. It is, however, difficult to perform such estimation for high-dimensional systems, especially with a small ensemble. Besides, the presence of a strange attractor complicates the numerical convergence and a proper definition of a continuous limit. Several solutions have been proposed to overcome the difficulty: compute relative entropies of marginals of p or compute expansions of the relative entropy.
Figures 3 and 4 illustrate the departure from Gaussianity and ways to measure it on a deterministic Lorenz-63 model (Lorenz 1963), where the negentropy can be estimated numerically. A Gaussian pdf sampled by particles, initially of covariance matrix 𝗣 = diag(σ2, σ2, σ2), with σ = 0.20, is transformed under the model flow. The full negentropy is estimated via a numerical integration. As explained by Kleeman (2002), relative entropy must be estimated at fixed resolution, possibly fine enough to encapsulate the attractor [a spacing of about r = Δx = Δy = Δz ≃ 0.1 has been chosen to discretize the integral Eq. (23)]. Edgeworth expansion, and its estimates based on one- and two-variable marginals are estimated as well. The result of the Edgeworth expansion cannot be directly compared to the full relative entropy estimate of finite resolution. Indeed the ensemble members tend to gather close to, or on, the attractor, which makes their distribution more and more singular with the flow’s evolution. For the sake of comparison, the ensemble must therefore be smoothed out by a normal law (of variance 0.5 here) yielding a finite resolution value. Obviously after time t = 0.5, the cluster of particles, stretched by the flow, loses its cohesion, and the pdf becomes significantly non-Gaussian. This is confirmed by the indicators based on the negentropy, as well as the Edgeworth expansion.
The deviation from Gaussianity is reported by each one of these indicators, though not with the correct magnitude. Yet, as numerical estimations of the Kullback–Leibler divergence, all these approximations are unsatisfactory.
b. Univariate and multivariate tests of normality
Because such computations cannot easily be generalized to high-dimensional, complex dynamical systems, one could rely on simpler necessary tests of normality. Hypothesis testing is a well-developed topic in statistics, and many techniques meant to test the Gaussianity of random variables exist.
Among the many tests of normality available in the statistical literature, the skewness and kurtosis, which are directly defined by the cumulants of a distribution, have been used very early. Lawson and Hansen (2004) have used them to assess how differently stochastic and deterministic ensemble-based filters handle non-Gaussianity. There are many other tests such as the Kolmogorov–Smirnov test (Lilliefors 1967), the Anderson–Darling test (Anderson and Darling 1952), the Shapiro–Wilk test (Shapiro and Wilk 1965), and their many variants.
A few generalizations of these tests to multivariate statistics do exist, but are not meant to handle system sizes as big as those in geophysics. One is therefore compelled to use necessary though insufficient tests. Examples have been given earlier with the use of marginal distributions in the computation of negentropy. One could also rely on combinations of variables in the systems, such as sums (or sums of squares) of individual degrees of freedom, which are supposed to be close to Gaussian. Assuming a Gaussian distribution for the elementary degrees of freedom, it may be possible to compute the distribution of these consolidated random variables. Then univariate null-hypothesis statistical tests, such as those mentioned earlier can be used in turn.
5. Reducing nonlinearity’s impact: Divide and conquer
With a denser monitoring network or more frequent observations, the model should remain closer to its tangent linear trajectory between analyses. Therefore, non-Gaussianity will not develop as much as in a system less constrained by observations. Nonetheless, if, with an increasing number of observations, the model resolution is increased as well and subgrid processes are explicitly represented at a finer scale, new sources of nonlinearity and non-Gaussianity might appear as discussed in section 3. In the context of meteorology, the finer the horizontal space scale, the bigger the error growth rate (Lorenz 1969; Tribbia and Baumhefner 2004; Lorenc and Payne 2007). As a consequence how non-Gaussian the errors are may well depend on their scale. Increasing both the space and time resolution and the observation density may lead to a data assimilation system with the appearance of significantly non-Gaussian errors statistics at the convective scale while synoptic-scale errors become smaller and more Gaussian.
In this section, following this paradigm but without going as far as adopting a broad multiscale view on non-Gaussianity, we review some of the ideas put forward to reduce non-Gaussianity and nonlinearity, so that classic data assimilation based on Gaussian hypotheses could become more efficient.
One idea consists in using targeted (also called adaptive) observations in order to obtain a better control. A second one consists in dividing the system between degrees of freedom that are more or less prone to nonlinearities, and hence require more or less accounting for non-Gaussian effects. Another idea consists in representing non-Gaussian features, such as multimodality, via a sum of individual Gaussian components.
a. Better control with adaptive observations
The analysis improvement and the reduction of its computational cost can be obtained by an assimilation that adapts to the properties of the dynamical flow, in particular its instability. For instance, Pires et al. (1996) have shown that the efficient variational assimilation length τeff(x) is proportional to λ−1(x), where λ(x) is the leading local Lyapunov exponent at x. From Eq. (3.15) of Pires et al. (1996), and relying on their simplifying assumptions [i.e., perfect model, frequent and regular observations within the assimilation window, and λ(x)Δt(x) ≪ 1], the leading analysis error variance e2(x) is constrained by: e2(x) ≤ 2λ(x)Δt(x)σ2, where observations of variance σ2 are obtained each Δt(x) (much shorter that the assimilation window). Thus, smaller observation intervals are required for cases that are more unstable.
Adaptive techniques in data assimilation also call for the deployment of targeted observations (TOs), pioneered by the singular vectors approach (Buizza and Montani 1999). Then, Daescu and Navon (2004) use the adjoint sensitivity approach, evaluating the sensitivity function ‖∇J‖, the norm of the gradient of a forecasting error functional J. Local maxima of this sensitivity function in the physical space determine the set
The number of tracking observations, necessary for the stabilization of a sequential prediction-assimilation system and tracking unstable flow, depends on the number and magnitude of the system’s m positive Lyapunov exponents (Carrassi et al. 2008). The efficient monitoring of the m-dimensional unstable space E is achieved through the blending of a fixed observational network with updated TOs. Efficient analyses with fixed observations are obtained through the assimilation in the unstable subspace (Carrassi et al. 2007) where the analysis increment is confined to the updated unstable subspace E, obtained by the method of breeding of the data assimilation system.
b. Bayesian filtering in reduced-rank system or subsystem
Following the divide and conquer strategy, the use of exact Bayesian techniques, such as particle filters, could be restricted to the significantly non-Gaussian degrees of freedom of the geophysical system. For instance, Lagrangian assimilation of data from oceanographic drifters is highly non-Gaussian since the positions of the drifters need to be controlled too. Spiller et al. (2008) have successfully tested several particle-filtering strategies on such drifters in a flow generated by point vortices. Berliner and Wikle (2007) and Hoteit et al. (2008) explore theoretically the use of particle filters, but on identified low-dimensional manifolds of the dynamics, or on a reduced-order model of a large geophysical system [e.g., through empirical orthogonal functions (EOFs)].
c. Localizing strategies for particle filters
In the context of the fully Bayesian estimation problem, non-Gaussian uncertainty could be reduced by localization of the analyses. Indeed the smaller the area, the smaller the number of degrees of freedom to handle, the less complex (e.g., multimodal) the local pdf of these degrees of freedom should be. As a consequence, the smaller the area, the lower is the necessary number of particles for a given precision estimate. However, contrary to localization in the EnKF, the analyses cannot be simply glued together to get a complete set of updated global particles. That is why localization was not used on the particle filter in the illustrations of section 2. This issue has been largely discussed by van Leeuwen (2009).
d. Gaussian mixtures
With a view to particle filtering, one could replace each particle of the ensemble by a broader Gaussian kernel. Unfortunately, for high-dimensional systems, this kernel representation also suffers from the curse of dimensionality (Silverman 1986). Recently proposed remedies are essentially filtering in the low-dimensional subspace related to the attractor of the complete system, as mentioned in the previous section. This can be implemented either by a localization and smoothing procedure (Bengtsson et al. 2003). Or it can be carried out thanks to a low rank representation of the error covariance matrix (Hoteit et al. 2008) inherited from the reduced rank Kalman filters. Note that in Bengtsson et al. (2003), the error covariance matrices associated with the kernels are not identical but generated by a Kalman filtering for each sample xi. Nevertheless, Gaussian mixture models are distinct from the unscented particle filter (Van der Merwe et al. 2000) where the sequential Monte Carlo sampling is performed according to locally linearized importance sampling functions given by the posterior pdf of Kalman filters for each of the particles (see section 6).
6. Bridging the gap between Gaussian and non-Gaussian data assimilation
There have been recent attempts to make use of non-Gaussian ideas in geophysical (or geophysically inspired) data assimilation. They remain quite specific in their application, because of their underlying hypotheses. They are, nevertheless, promising and a discussion on their relevance to geophysical data assimilation is presented. Contrary to section 3 where non-Gaussian errors were described and modeled mathematically, the emphasis here is on producing the analyses that cope with those non-Gaussian errors.
a. Statistical expansion about the climatology
In an ensemble-based filtering system, the estimates of the first- and second-order moments rely on the ensemble itself. Instead of forming a Gaussian pdf as a prior for the analysis using these statistics, one computes the pdf that is closest to the climatology and that has the same first- and second-order moments, as shown by Eyink and Kim (2006).
This work has also been put forward by van Leeuwen (2009) in his review on particle filters. However, we do not consider the method to be a particle filter, since it involves the truncation of the moments to second order, and the most innovative part is the treatment of the prior. Though the idea is very appealing, it remains to be proven that an attractor of the dynamics can be described analytically or numerically so that this information can be used in the method. Eyink and Kim (2006) tested their method on the Lorenz-63 model, using a mixture of two Gaussians to describe the two lobes of the attractor. The results show that the method eventually outperforms the EnKF, but in a regime where the filter is very nonlinear. This occurs when the time interval between two analyses reaches about Δt = ⅔, possibly when the climatology starts having a significant impact on the filter trajectories. This might not reflect realistic conditions, since the time interval between two analyses in weather forecasting would rather correspond to Δt = 0.05.
b. Gaussian anamorphosis
One way to treat non-Gaussianities is to attempt to transform, analytically or numerically, non-Gaussian random variables into Gaussian ones, on which a BLUE-based analysis can appropriately be carried out.
1) Analytical transformation
The variational version of this analysis, essentially based on Eq. (17), was examined by Fletcher and Zupanski (2006), including thorough discussions on how to choose a proper estimator and how to precondition the minimization of a cost function such as Eq. (17). Mapping the lognormal errors to a Gaussian space is a particular (analytical) case of a Gaussian anamorphosis.
2) Numerical transformations
When an analytical transformation to Gaussian space is not possible because the errors do not necessarily follow a lognormal behavior, then numerical methods can be used to achieve a similar goal. This is called a (numerical) Gaussian anamorphosis. This technique is well known in geostatistics (Wackernagel 2003). Its use has been advocated by Bertino et al. (2003) in the context of geophysical data assimilation (see also the next section). The idea of performing the analysis in the Gaussian space is the same as that of Cohn (1997), but for a general, albeit numerical transformation.
In principle, a Gaussian anamorphosis is needed in both state space and observation space, then analysis equations similar to Eqs. (35) and (36) can be applied in Gaussian space. An inverse Gaussian anamorphosis is then built to pull the analyzed fields back into the original space.
It has recently been implemented on a large ocean and biogeochemical model by Simon and Bertino (2009) with success in a twin experiment. The transformation was applied to a chlorophyll field. Applying this methodology to such a large-scale experiment is not simple. As a first reasonable step, the authors neglected the correlations, and considered some climatological statistical univariate distributions of the non-Gaussian variables, when the anamorphosis is well defined and simple to implement. To take into account the full correlations, one would need to consider multivariate anamorphosis. With multivariate statistics, one would have to rotate the state space to get uncorrelated variables, by principal component analysis or independent component analysis (Hyvärinen and Oja 2000), and then apply Gaussian anamorphosis to each of the marginals. However, a non-Gaussian part of mutual information (in other words, residual correlations) remains in the rotated space (Pires and Perdigão 2007).
3) Humidity transform in meteorological models
4) Gaussian analyses under linear inequality constraints
Some additional constraints may render a Gaussian data assimilation scheme non-Gaussian. This may happen when the prior forces the control variables to lie in a polytope (to satisfy linear inequalities), or when an observation error prior must account for outliers (as in section 2). If the unconstrained priors are Gaussian, then the constrained priors are truncated Gaussian priors. Remarkably, several Gaussian data assimilation schemes can be extended to the truncated Gaussian case in a mathematically rigorous way, with limited complications.
In variational data assimilation, Lagrangian duality (Borwein and Lewis 2000) can be used to lift these constraints, either on the observational errors (as made explicit in section 2), or in the state background errors (Bocquet 2008), through the use of Lagrange multipliers. The transformation is essentially exact if the cost functions are convex, a requirement that may not be satisfied if the models are nonlinear.
In ensemble-based Kalman filtering, filters can be extended to deal with linear inequalities. Assume the background is a truncated Gaussian, whereas the observation errors are normal. Then the analysis as seen from the Bayes’s formula yields the product of a truncated Gaussian by a Gaussian, which is in turn a truncated Gaussian. Besides, the analysis uses the same set of operators (such as the Kalman gain) as in the unconstrained case. This makes the use of such a scheme very practical. The major change comes from the need to sample from a truncated Gaussian, which is not straightforward for high-dimensional problems. The truncated Kalman filter was developed by Lauvernet et al. (2009) in a geophysical context and successfully tested on a one-dimensional mixed layer ocean model.
c. Using non-Gaussian deviations in the priors to improve analysis
Given the sampling statistics of innovations, it is possible to compute the mean and covariance matrix of the innovation vector. This is useful to correct error biases and for tuning the prescribed error covariance matrices (Desroziers et al. 2005). Beyond those statistics, innovation histograms and higher-order moments of the innovations can also be computed, for instance some measures of non-Gaussianity like the skewness sd and kurtosis kd. Pires et al. (2010, manuscript submitted to Physica D) have computed diagnostics of sd, kd for the quality-controlled ECMWF innovations of brightness temperatures of a set of HIRS channels. Their results emphasized the statistically significant non-Gaussianity of the errors in several channels. They estimate a joint non-Gaussian prior pdf for the observations errors ϵo and background errors ϵb in the observation space, using the maximum entropy on the mean (MEM) method. The method follows the same principle as the one used and exemplified in sections 6a and 6e. The output of the method is a pdf of ϵo and ϵb, compatible with the prescribed innovation statistics. Moreover, it is minimally committed in the sense that, from the information theory point of view, it is the simplest pdf (with minimal extra information), that explains the prescribed statistics. This prior modeling can be shown to be beneficial to the subsequent analyses that go beyond the BLUE result.
d. Particle filtering with Gaussian filters as importance proposals
We come back to the ideas of the particle filter. We will see how Gaussian analyses can help to numerically solve the Bayesian estimation problem. The ideas are fairly recent in the geophysical community and the extrapolation to complex systems is speculative.
1) Concept of importance sampling
In our opinion, the same importance sampling principle can be used to justify two attempts to improve particle filtering from the geophysicist point of view. Xiong et al. (2006) make use of a Gaussian resampling, based on the particles’ first- and second-order moments. It was shown to improve the forecast ability of the particle filter, often beating the EnKF, in the case of the Lorenz-63 model. The merging particle filter of Nakano et al. (2007) is, in a similar flavor, a Gaussian resampling (matching of first- and second-order moments) and is used to enrich the sampling of the particle filter. The authors demonstrate a significant improvement with the Lorenz-63 and Lorenz-95 models, but the particle filter still necessitates too many particles as compared to the EnKF, even on these toy models. Although it is not stated in those words, these papers illustrate the use of Gaussian hypotheses on rigorously non-Gaussian estimation, through the use of importance sampling. However, to our knowledge, the necessary correction to the weights for the particle filters, in order to guarantee the proper Bayesian asymptotics, were not computed.
Now we come back to the problem of the collapse of the particle filter. The basic idea that was fostered in the applied mathematics community, and advocated by van Leeuwen (2009) in geophysics, is that in order to avoid too unlikely trajectories, particles should be drawn at time tk−1 from a proposal making use of yk. For high-dimensional applications and complex models, this is certainly not trivial to implement. The following section gives clues, and original numerical examples based on this idea.
2) Current-observation-dependent proposal with Gaussian analyses
Two words of caution are in order about the WEnKF. Firstly, the particles are interacting (following the terminology of Del Moral 2004), not only through standard resampling, but also through the estimation of the covariance matrix. Second, as a Gaussian pdf the proposal function has strongly vanishing tails. Therefore, on one hand, the WEnKF should be very efficient in the regime where the EnKF outperforms particle filters as compared to other particle filters. On the other hand, it may be weaker in a regime where simpler particle filters outperform the EnKF. Thus, we believe that the overall interest in the WEnKF is debatable, even though very appealing.
e. Nonperturbative non-Gaussian methods for high-dimensional linear models
There are relevant geophysical cases where the models are approximately linear. But the priors may be intrinsically different from Gaussians. This is the case for tracer transport, radionuclides dispersion, dust, several greenhouse gases including CO2, etc. In that context (i.e., linear models and non-Gaussian priors), and unlike previous examples, a non-Gaussian analysis can be performed thoroughly without approximation, using nonlinear convex analysis (see Borwein and Lewis 2000). The theory is based on nonquadratic cost-functions that generalize 4D-Var and the Physical space Statistical Analysis System (PSAS) in the specific linear model case (Bocquet 2005b,c, 2007; Krysta and Bocquet 2007; Bocquet 2008).
1) Bayesian inference and maximum a posteriori
However, such a dual equivalence is possible only if
2) Maximum entropy on the mean inference
An application to the forecast of an accidental plume of pollutant is given in the context of the European tracer experiment (Nodop et al. 1998) in Fig. 7. About 103 observations are used for 2 × 104 control variables. The analyses are hence conveniently carried out in the observation space. The plume contours obtained by the MEM method are much finer than with 4D-Var, which is of utmost importance for such a dispersion event, especially in the regions where the concentration field exhibits strong gradients.
The strictly Bayesian solution of the previous section is different from the MEM solution: the exponential pdf in Eq. (63) is generally not the posterior pdf obtained from Bayes’s rule. Instead, the MEM method convexifies the objective function that would be obtained from Bayes’s rule. Therefore, if the existence of multiple minima matters in the problem, the MEM approach may differ significantly from the strictly Bayesian solution. However, the analysis of multiple minima in geophysical applications is rather speculative. This would likely happen when considering a strongly constrained 4D-Var, with a sufficiently large assimilation window. Rather than facing such a multiple minima optimization problem, it is tempting to convexify anyway the objective function by, for instance, incorporating model error (weakly constrained 4D-Var). The problem of convexification is the arbitrariness of the regularization of penalty functions, which can result in an unlikely solution. However, for pollutant source reconstruction problems, Bocquet (2008) has shown that the difference between the two approaches is small.
So far, this discussion was relevant for the state estimation problem. For second-order and higher-order moments, the MEM method would not give the correct estimates even in the Gaussian prior case. To circumvent this drawback of the MEM method, Bocquet (2008) showed that correctly defined moments of the MEM inference with a prior ν could be obtained from the strictly Bayesian inference but using the prior exp [−(ν̂)*]. Within this extension of the MEM method, a precise correspondence is defined between the two approaches.
3) Second-order sensitivity analysis
This non-Gaussian second-order analysis is illustrated with an original experiment on the inversion of the Chernobyl accident radionuclides source term. The physics is essentially linear so that the methodology applies without approximation. The positivity of the source term requires a non-Gaussian background error prior modeling. Figure 8 illustrates several sensitivities in both Gaussian and non-Gaussian analysis cases. The global marginal gain of information ∂y
4) Score
7. Perspectives
The theoretical and numerical Bayesian solutions to the estimation problem have been shown not only to be appealing but also quite natural. Yet, the computational complexity that prevents their use ultimately leads to the success of 4D-Var and the ensemble Kalman filter. However, non-Gaussianity generated by the nonlinearities of model or intrinsic to the priors has not vanished. The ever-increasing computing power and widespread parallel architectures, even on cheap systems, make the use of more sophisticated applied mathematical solutions tempting. Nevertheless, the complexity of the estimation from these solutions does not scale reasonably (e.g., linearly) with the high dimensionality and complexity of geophysical systems, and relying on computing power is insufficient.
In this review, it has been shown that one can use sophisticated tools to diagnose non-Gaussianity in data assimilation with concepts inherited from statistics and information theory. More importantly, several examples of solution were given with promising performances. A few were of perturbative nature using an expansion of the Gaussian analysis system (weakly non-Gaussian prior construction). A few were of nonperturbative nature with little approximation (e.g., maximum entropy filter and Gaussian anamorphosis). Others were of fully nonperturbative nature taking advantage of Gaussian analysis guidance, or some linearity in the models (e.g., the optimal importance function particle filter, the weighted ensemble Kalman filter, and maximum entropy variational inference). These examples all remain specific, either because they rely on an assumption difficult to generalize, or because they were only tested on relatively low-dimensional systems so far. However, they can already be used on highly nonlinear subsystems of larger geophysical systems, such as Lagrangian drifters in a flow, or a submanifold of the dynamics. Alternatively, they can be used on real applications that do possess simplifying features, such as model linearity.
Increasing computing power might not only serve advanced data assimilation techniques, but also allows one to process more observations (denser coverage in space and time), and finer model resolution. As a consequence, models may become locally more and more linear and more and more Gaussian between analyses. However, this argument has been partly refuted by the nonlinearity of small-scale physics, in conjunction with the fundamentally multiscale nature of geophysical systems. That is why we believe proper handling of non-Gaussianity will remain an important issue.
We think that more general solutions for high-dimensional systems than those exemplified in this review will require the simultaneous reduction of the model’s dynamical degrees of freedom, spatial dividing–localization strategies, and an efficient sampling strategy (in connection with model error characterization). Particle filters or variants that are more advanced will then eventually be useful.
The number of possibilities to build up new solutions is tremendous, especially for filtering. It is to be expected that more and more applications will make use of an increasing number of theories mixing sequential and variational approaches, or combining Gaussian analysis and fully Bayesian ones. We expect that comparisons of all these methodologies and contexts will be a (very) difficult task. Theoretical and general guidance will then be needed to sort them all, with both mathematical analysis and the use of high-dimensional geophysical benchmarking models.
Acknowledgments
M. Bocquet is grateful to E. Kalnay and L. Fillion, organizers of the WWRP/THORPEX workshop on “4D-VAR and Ensemble Kalman Filter Intercomparisons,” held in Buenos Aires, Argentina, November 2008. The paper follows the overview of the session “Issues of Nonlinearity and non-Gaussianity” presented at this workshop. The authors are indebted to an anonymous reviewer, C. Snyder, and H. L. Mitchell for their substantial and thorough suggestions that helped to improve the manuscript. M. Bocquet acknowledges stimulating discussions with P. J. van Leeuwen and O. Pannekoucke. Finally, the authors thank M. Krysta and L. Delle Monache for their careful reading of the manuscript and for their useful comments.
REFERENCES
Anderson, J. L., and S. L. Anderson, 1999: A Monte Carlo implementation of the nonlinear filtering problem to produce ensemble assimilations and forecasts. Mon. Wea. Rev., 127 , 2741–2758.
Anderson, T. W., and D. A. Darling, 1952: Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Stat., 23 , 193–212.
Andersson, E., and H. Järvinen, 1999: Variational quality control. Quart. J. Roy. Meteor. Soc., 125 , 697–722.
Andersson, E., M. Fisher, E. Hólm, L. Isaksen, G. Radnóti, and Y. Trémolet, 2005: Will the 4D-Var approach be defeated by nonlinearity? Tech. Rep. 479, ECMWF, 28 pp. [Available online at http://www.ecmwf.int/publications/library/ecpublications/_pdf/tm/401-500/tm479.pdf].
Andersson, E., and Coauthors, 2007: Analysis and forecast impact of the main humidity observing systems. Quart. J. Roy. Meteor. Soc., 133 , 1473–1485.
Auroux, D., 2007: Generalization of the dual variational data assimilation algorithm to a nonlinear layered quasi-geostrophic ocean model. Inverse Probl., 23 , 2485–2503.
Barndorff-Nielsen, O. E., and D. R. Cox, 1989: Asymptotic Techniques for Use in Statistics. Meteor. Monogr., No. 31, Chapman & Hall, 252 pp.
Bellman, R., 1961: Adaptive Control Processes: A Guided Tour. Princeton University Press, 255 pp.
Bengtsson, T., C. Snyder, and D. Nychka, 2003: Toward a nonlinear ensemble filter for high-dimensional systems. J. Geophys. Res., 108 , 8775. doi:10.1029/2002JD002900.
Berliner, M. L., and C. K. Wikle, 2007: Approximate importance sampling Monte Carlo for data assimilation. Physica D, 230 , 37–49.
Bertino, L., G. Evensen, and H. Wackernagel, 2003: Sequential data assimilation techniques in oceanography. Int. Stat. Rev., 71 , 223–241.
Bishop, C. H., B. J. Etherton, and S. J. Majumdar, 2001: Adaptive sampling with the ensemble transform Kalman filter. Part I: Theoretical aspects. Mon. Wea. Rev., 129 , 420–436.
Bocquet, M., 2005a: Grid resolution dependence in the reconstruction of an atmospheric tracer source. Nonlinear Processes Geophys., 12 , 219–234.
Bocquet, M., 2005b: Reconstruction of an atmospheric tracer source using the principle of maximum entropy. I: Theory. Quart. J. Roy. Meteor. Soc., 131 , 2191–2208.
Bocquet, M., 2005c: Reconstruction of an atmospheric tracer source using the principle of maximum entropy. II: Applications. Quart. J. Roy. Meteor. Soc., 131 , 2209–2223.
Bocquet, M., 2007: High resolution reconstruction of a tracer dispersion event. Quart. J. Roy. Meteor. Soc., 133 , 1013–1026.
Bocquet, M., 2008: Inverse modelling of atmospheric tracers: Non-Gaussian methods and second-order sensitivity analysis. Nonlinear Processes Geophys., 15 , 127–143.
Borwein, J. M., and A. S. Lewis, 2000: Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, 273 pp.
Buehner, M., P. L. Houtekamer, C. Charette, H. L. Mitchell, and B. He, 2010: Intercomparison of variational data assimilation and the ensemble Kalman filter for global deterministic NWP. Part II: One-month experiments with real observations. Mon. Wea. Rev., 138 , 1567–1586.
Buizza, R., and A. Montani, 1999: Targeted observations using singular vectors. J. Atmos. Sci., 56 , 2965–2985.
Burgers, G., P. J. van Leeuwen, and G. Evensen, 1998: Analysis scheme in the ensemble Kalman filter. Mon. Wea. Rev., 126 , 1719–1724.
Cane, M. A., A. Kaplan, R. N. Miller, B. Y. Tang, E. C. Hackert, and A. J. Busalacchi, 1996: Mapping tropical Pacific sea level: Data assimilation via a reduced state space Kalman filter. J. Geophys. Res., 101 , (C10). 22599–22617.
Carrassi, A., A. Trevisan, and F. Uboldi, 2007: Adaptive observations and assimilation in the unstable subspace by breeding on the data-assimilation system. Tellus, 59A , 101–113.
Carrassi, A., M. Ghil, A. Trevisan, and F. Uboldi, 2008: Data assimilation as a nonlinear dynamical systems problem: Stability and convergence of the prediction-assimilation system. Chaos, 18 , 023112.
Chapnik, B., G. Desroziers, F. Rabier, and O. Talagrand, 2006: Diagnosis and tuning of observational error in a quasi-operational data assimilation setting. Quart. J. Roy. Meteor. Soc., 132 , 543–565.
Cohn, S. E., 1997: An introduction to estimation theory. J. Meteor. Soc. Japan, 75 , 257–288.