## Abstract

In data assimilation (DA) schemes for numerical weather prediction (NWP) systems, the estimation of forecast error covariances is a key point to get some flow dependency. As shown in previous studies, ensemble data assimilation methods are the most accurate for this task. However, their huge computational cost raises a strong limitation to the ensemble size. Consequently, covariances estimated with small ensembles are affected by random sampling errors. The aim of this study is to develop a theory of covariance filtering in order to remove most of the sampling noise while keeping the signal of interest and then to use it in the DA scheme of a real NWP system. This first part of a two-part study presents the theoretical aspects of such criteria for optimal filtering based on the merging of the theories of optimal linear filtering and of sample centered moments estimation. Its strength relies on the use of sample estimated quantities and filter output only. These criteria pave the way for new algorithms and interesting applications for NWP. Two of them are detailed here: spatial filtering of variances and covariance localization. Results obtained in an idealized 1D analytical framework are shown for illustration. Applications on real forecast error covariances deduced from ensembles at convective scale are discussed in a companion paper.

## 1. Introduction

Data assimilation (DA) is a corpus of methods whose goal is to define a state vector called the “analysis” as close as possible of the “true” state, using several information inputs and their probability distributions. In the current schemes used for operational numerical weather prediction (NWP), there are two main sources of information: a prior state vector called “background”—usually a short term forecast—and observations. The respective weights of these sources in the analysis retrieval are deduced from their error covariance matrices, that is, the second-order centered moments of their distributions.

However, because of the huge size of state vectors in operational NWP systems ( elements), the background error covariance matrix (hereafter denoted ) cannot be easily estimated, and not even stored (Dee 2005). To deal with this issue, the first option is to model the square root of as a succession of invertible operators that can be applied at a low cost on the control vector [see Bannister (2008a,b) for a detailed review]. These operators are generally calibrated using ensemble of forecast differences (e.g., Belo Pereira and Berre 2006). This method has been thoroughly used in most variational DA schemes. Its strength comes from the high rank and the smoothness of the induced . However, the successive operators are often climatological, even if some flow dependency can be added through nonlinear balance operators and specific representations of the spatial transforms to allow a certain degree of heterogeneity and anisotropy (Fisher 2003; Purser et al. 2003; Deckmyn and Berre 2005; Michel 2013). As a consequence, such a formulation is clearly suboptimal for meteorological phenomena that are underrepresented in the ensemble because of their low probability, as, for instance, precipitations (Caron and Fillion 2010; Montmerle and Berre 2010) or fog (Ménétrier and Montmerle 2011). Strong geographical and seasonal dependencies have also been shown, especially for limited-area models (Michel and Auligné 2010; Monteiro and Berre 2010; Brousseau et al. 2012).

The second option is to perform an ensemble of perturbed forecasts, whose sample dispersion matches the forecast error covariance. This method has arisen from the ensemble Kalman filter (EnKF) scheme (Evensen 1994; Houtekamer and Mitchell 1998) and aims at producing flow-dependent covariances. This can be beneficial compared to the climatological formulation, provided an efficient filtering of the random sampling noise due to the limited ensemble size. As a matter of fact, the computational cost of running such an ensemble is very high, especially at the convective scale, and only a low number of members are affordable in operations (Ménétrier et al. 2014, hereafter M2014). Long-range noise on covariances is generally removed with a localization, which consists in a Schur product (i.e., element by element) with a localization matrix. This procedure can be seen as a special case of linear filtering and will hereafter be called Schur filtering (Houtekamer and Mitchell 2001; Buehner and Charron 2007). However, the definition of the localization matrix is still very empirical (Anderson and Lei 2013) and often suboptimal. The optimal localization is known to depend on the ensemble size, the variable, and the vertical and horizontal location, which are often poorly taken into account because of a lack of theory (Lorenc 2003).

New DA schemes try to merge variational and ensemble approaches to gain from their respective benefits. A first direction is to add some flow dependency in the successive operators of the square root model by estimating their parameters from an ensemble. This is done operationally for variances at Météo-France (Raynaud et al. 2009) and at the European Centre for Medium-Range Weather Forecasts (ECMWF; Bonavita et al. 2012), but also recently for spatial correlations (Varella et al. 2011). In both cases, a filtering of the parameters is required before any use in to remove the sampling noise. Another direction is to use the full ensemble covariances in an additional term of the variational cost function (EnVar-like schemes), making it hybrid (Hamill and Snyder 2000; Lorenc 2003; Buehner 2005). As for EnKF-like methods, such approaches require a localization of covariances through a Schur filtering.

The linear filtering of sample covariances is thus a key point of state-of-the-art DA schemes. This paper aims at developing a unified theory to get an optimal linear filtering of covariances, in the sense that the Euclidean norm of the filtering error is minimized. The two direct applications of such a theory presented in this paper are variance filtering and covariance localization. References to existing methods are presented and discussed in section 2, in order to explain links, differences, and motivations with respect to the new methods presented in the paper. The outline of the paper is also presented at the end of the section.

## 2. Motivations

### a. Variance filtering

Ensemble-based estimates of background error variances are affected by sampling noise, and there is experimental evidence that this sampling noise is of smaller scale than the signal of interest both with global models (Berre et al. 2007; Bonavita et al. 2012) and at convective scale (M2014). This has motivated the development and application of spatial filtering techniques, which aim at removing small-scale sampling noise while preserving the large-scale signal as much as possible. This variance filtering approach is operationally used for global data assimilation systems at Météo-France and at ECMWF, and it is currently based on a static, horizontally homogeneous and isotropic filter, which can thus be estimated and specified in spectral space.

In this framework, it is possible to derive an expression for estimating an optimal spectral filter, which is based on the minimization of the mean square error (MSE) of the filtered variance estimation. This optimal spectral filter is a simple function of a noise-to-signal ratio for each spectral component, which can be expressed as function of power spectra of noise and of signal (Berre and Desroziers 2010). This amounts to estimating two quantities in spectral space, which are averaged isotropically.

First, the power spectrum of sampling noise can be estimated either from the difference of variances provided by two independent ensembles (Berre et al. 2007; Bonavita et al. 2012) or from an expression as function of perturbation covariances (Raynaud et al. 2009), which is a result of the Wishart theory (Wishart 1928; Muirhead 2005) and which is given by Eq. (12) in the next section. Second, the power spectrum of the signal can be estimated from the difference between the power spectrum of sample variances and the power spectrum of sampling noise. The computation of associated spectral noise-to-signal ratios leads typically to a low-pass filter, in accordance with the different spatial structures of noise and signal.

These methods can also be seen as being closely related to the optimal linear filtering theory [see Eq. (18a) in section 4] and to the Wiener–Khintchine theorem (Wiener 1930; Khintchine 1934; Chatfield 2003). This theorem states that under some stationarity conditions, the autocovariance of a random process has a spectral decomposition given by the power spectrum of that process.

While these methods have proved to be efficient, some issues may be raised regarding several aspects. In particular, the calibration of the spectral filter is itself affected by sampling noise, in particular for the first small wavenumbers (which are representative of large-scale features) as discussed in Berre and Desroziers (2010). A first possible approach for this issue is to estimate the spectral shape of the filter from a temporal set of sample variance realizations (Berre et al. 2007). A second possible method is to fit an analytical spectral filter formula, whose length scale can be estimated from the raw estimated spectral filter (Raynaud et al. 2009). A third possible approach is to employ more than two independent ensembles [e.g., Bonavita et al. (2012) use five independent ensembles] in order to increase the size of the statistical sample that is used to estimate the optimal filter spectrum.

On the one hand, it has been shown by Bonavita et al. (2012) that this third approach is better than the second one, in the sense that this allows the spectral shape of the filter to be estimated more accurately than with a prespecified analytical filter. On the other hand, while this is feasible in offline mode for calibrating a static filter (note that currently operational applications at Météo-France and at ECMWF use a static filter), this is likely to be too costly if one wishes to employ an online calibration.

In the current paper, we will explore another possible approach for calibrating the filter, which is based on the specification of an analytical low-pass filter formula, whose length scale can be adjusted, and on the derivation of optimality criteria based on raw sample variances (filter inputs) and filtered variances (filter outputs). The focus of this paper is on the mathematical derivation of optimality criteria and on their evaluation in an idealized context, and Ménétrier et al. (2015, hereafter Part II) focuses on the experimentation of this approach for a convective-scale NWP model. A detailed comparison with existing calibration methods is not conducted in this paper [except for a preliminary comparison with Raynaud et al. (2009) in Part II], although this may be considered in future studies.

This new approach is motivated first by the wish to possibly perform an online calibration of the filter for the Applications of Research to Operations at Mesoscale (AROME) convective-scale NWP model. As shown in many studies [e.g., Brousseau et al. (2012) for the AROME model], forecast error covariances of a limited-area model at convective scale are indeed strongly flow dependent, and the optimal filter may change consequently depending on the considered date and season.

A second motivation is the wish to ensure that filtered variances are always positive, which is not guaranteed by previous variance-filtering methods. Negative variances might appear locally for both spectral filtering (Raynaud et al. 2009) when the filtering function has negative lobes and wavelet filtering (Bonavita et al. 2012). This issue may be more serious in areas of strong gradients that are likely to occur at convective scale (see M2014).

To ensure that filtered variances remain positive, we think that one of the simplest methods is to use a positive parameterized filter. To optimize its parameters, we would need some optimality criteria that could be easily computed from available data: raw variances, filtered variances, and any other field derived from the ensemble. As it will be shown in section 5b, the way we found to determine such criteria is to develop some results of the linear filtering theory and to merge them with known statistical properties of sample covariances.

Finally, while a heterogeneous and non-Gaussian approach can be used in the previous techniques, these different options will also be considered in the proposed method based on a prespecified filter and optimality criteria. This will be examined in Part II of this two-part study, in particular using the AROME model, with a novel evaluation of non-Gaussian effects and the experimentation of a heterogeneous approach using tiles.

### b. Covariance localization

Several methods to compute adaptive covariance localizations have been developed, but they often present significant drawbacks.

Most of them have been developed in an EnKF framework, where the covariances are computed between model and observation spaces, and where observations are assimilated sequentially (Anderson and Lei 2013; Bishop and Hodyss 2009a,b). These methods do not always work in EnVar-like schemes where the localization is required in model space.

Some are making strong assumptions about the localization function. For instance, Bishop and Hodyss (2007) assume that the localization function can be taken as a power of the background error correlation function.

Most methods are tuning the localization in order to optimize the analysis. As a consequence, the determination of the optimal localization is either costly [e.g., requiring several ensembles as in Anderson (2007)] or complex to implement [e.g., using OSSE as in Anderson and Lei (2013)].

As well as for variance filtering, we would like to get a cheap method to objectively diagnose localization functions from the ensemble only. We have found that it is possible if we change our target: we do not intend to optimize the analysis anymore, but we are trying to minimize the difference between the localized covariance matrix and the asymptotic covariance matrix that would be obtained with an infinite size ensemble. This new aim allows an analytical development to get an optimality criteria for the Schur filtering and, finally, an explicit formulation of the localization function. Similarly to the case of variance filtering, finding this optimal localization function requires the merging of results from the Schur filtering theory with known statistical properties of sample covariances.

### c. Common framework

As mentioned earlier, variance filtering and covariance localization are considered in this paper as two different ways of filtering sample covariances. The theory of centered moments estimation is required for both applications and will be detailed in the next section. Also needed, the theory of optimal linear filtering and the special case of optimal Schur filtering will be presented together in section 4. In section 5, the main results are given, with the definition of an optimality criteria for the linear and Schur filtering of sample covariances. Sections 6 and 7 propose two methods to apply this criterion regarding background error variance filtering and background error covariance localization. Illustrations are given in an idealized framework in section 8, before concluding remarks. Figure 1 illustrates the way to proceed in this article, from existing theories to new optimality criteria, and finally to applications. Practical implementations with an operational convective-scale NWP model are detailed in Part II.

## 3. Elements of the centered moments estimation theory

In this section, we recall how the second- and fourth-order moments of a random vector can be estimated from ensembles and how some of their sampling statistics can be derived.

### a. Sampling errors

Let be a random vector of size *n* and its centered counterpart:

where denotes the mathematical expectation. In NWP, these quantities are often referred to as “background” and “background error”, respectively, assuming that the background error is unbiased. The elements of the and matrices, corresponding to the true second- and fourth-order centered moments of the distribution of , are given by

and

respectively, where the indices refer to a vector element.

Assuming that we have a sample of *N* forecasts, denoted , the corresponding perturbations are given by

where the angle brackets denote the sample mean:

and

where the index *p* refers to an ensemble member.

The asymptotic value of the covariance matrix and of the fourth-order centered moments for an infinite sample size are denoted by and :

and

As shown in M2014, the dynamics of in an ensemble data assimilation system closely follows the dynamics of the true covariance error matrix . However, because of misspecifications in the sample generation, it is possible that the asymptotic values and do not converge to the true values and . In this paper, we will not deal with these systematic errors and , but we will instead focus on the sample errors and induced by the limited sample size.

### b. Statistics of sample centered moments

From multivariate sample theory (Muirhead 2005), we know that the sample covariances computed from Eq. (5a) are unbiased

From an experimental point of a view, Eq. (7) corresponds to a simple fact: if an ensemble of size is split into distinct subsets of size *N*, and if estimations of are computed for each subset, then the average of over the subsets converges to when goes to infinity.

Using a covariance estimator rigorously equivalent to Eq. (5a) given by

it is possible to show, without any further assumptions, that the covariance of sample covariances is given as a function of the asymptotic second- and fourth-order centered moments:

The demonstration is not complicated but is tedious and will not be given here for sake of clarity. It is based on the count of involved combinations for three kinds of terms depending on the cardinal of the intersection , which can take the values 0, 1, or 2.

If the sample distribution is Gaussian, we can express the asymptotic fourth-order centered moment as a function of asymptotic covariances , through the Wick–Isserlis theorem (Isserlis 1916; Vignat 2012):

Then, Eq. (9) can be simplified in

This result can also be derived from the Wishart theory (Wishart 1928; Muirhead 2005), which gives any moment of the distribution of in the case of a Gaussian sample distribution. If we consider the covariance of sample variances only, we find from Eq. (11) that

which was proved in Raynaud et al. (2009) through a direct calculation.

The sample fourth-order centered moment computed from Eq. (5b) is biased, but its expectation can be expressed as a function of asymptotic second- and fourth-order moments:

This result is an extension to a multivariate case of Dodge and Rousson (1999) results. If the sample distribution is Gaussian, the Wick–Isserlis theorem of Eq. (10) allows a simplification of Eq. (13):

## 4. Optimal linear filtering theory

The second mathematical basics that are essential to understand the criteria exposed in the next section are given here. The optimal linear filtering theory is recalled at first. It is then shown how statistics of the true signal can be expressed using only input and output of the filter.

### a. Optimality definition

In linear filtering theory, the “truth” is considered as a random vector of size *n*, from which we have a noisy estimation only (Wiener 1949). It should be noted that in the previous section, the truth was considered as a fixed and known value, which is not the case here. Hereafter, the expectation is thus taken over the whole distribution of truth and noise. We define the linearly filtered estimate as

where is an matrix called the “filter gain” and is a vector of size *n* called the “filter offset.” The filtering error is defined as

The filter gain and offset are chosen to minimize the expectation of the squared norm of the filtering error . At optimality, the partial derivatives of this quantity with respect to every coefficient of and should be zero. Doing that, we can get the following vectorial relationships at optimality:

and

which mean that the filtering error is unbiased and orthogonal to the noisy estimation. Note that the orthogonality is equivalent to the absence of correlation in the unbiased case. Equation (17b) can be linked to a well-known property of the best linear unbiased estimator (BLUE) in the data assimilation framework: the orthogonality between the analysis error (here the filtering error ) and the analysis (here the filtered estimation ). Since the filtered estimation is here linearly related to the noisy estimation through Eq. (15), the orthogonality also applies between the filtering error and the noisy estimation . It is then straightforward to derive an explicit expression of the filter gain and offset:

and

where the meaning of is extended to vectors and : , where defines the autocovariance of with itself.

### b. Assumptions on the sampling error

The sampling error is defined as

We assume that this error is unbiased and orthogonal to the truth (which amounts to a zero correlation assumption):

and

Thus, we get the following simplifications:

and

### c. Properties at optimality

Since the filter is linear, the filtering error is orthogonal to the filtered estimate:

Thus,

so the autocovariances of the filtering error can be expanded as

Successively using the orthogonality properties in Eqs. (20b) and (17b), the expectation of a product of elements of the truth can be expressed as the expectation of a product of elements of the noisy and filtered estimates:

This relationship will be important in the remainder of the paper, in the sense that it indicates that statistics such as , where is unknown, may be estimated from statistics related to , where the noisy estimate is known (it is an input), and where the optimally filtered can be assumed to be either approached or exactly known if optimality is exactly reached.

### d. Schur filtering case

The Schur product of two vectors, denoted by the “” symbol, is an element-by-element product. For instance, the elements of a vector are given by . The Schur filtering of the noisy estimation is thus performed through a Schur product:

which is therefore equivalent to a linear filtering with a diagonal filter gain [elements of the vector being the diagonal elements of the filter gain from Eq. (18a)] and no filter offset [ in Eq. (18b)]. From Eqs. (16) and (21a), we notice that this filter is biased:

where **1** is a vector of unit elements. In a similar manner to section 4a, the partial derivatives of should be set to zero at optimality of the Schur filter. The resulting expression is that at optimality:

Using Eq. (26) to extract the coefficients of vector , it is straightforward to transform Eq. (28) into

If the sampling error is unbiased and not correlated with the truth [assumptions (20a) and (20b)], then the properties (22), (24), and (25) of an optimal linear filter can be transposed for an optimal Schur filter, respectively:

and

## 5. Criteria of optimal linear filtering of covariances

In this section, the theories of centered moments estimation and of optimal linear filtering are merged in order to find optimality criteria for both the linear and the Schur filtering. Such criteria are directly expressed with sample estimated quantities and filter output.

### a. Random variable with random moments

On the one hand, in the centered moments estimation theory, the asymptotic moments and are not considered as random variables, but as “given” data. On the other hand, in the linear filtering theory of section 4, the truth is necessarily a random variable. However, the two approaches can be combined by considering that and are unique realizations of a hypothetical random process. The generation of sample centered moments and is then a two-step random process:

A random process generates the asymptotic centered moments and .

A random process generates the sample , whose statistical properties are consistent with the realization of and obtained after the first random process . The sample centered moments and are then estimated from this sample.

Since they have very different natures, it seems reasonable to assume that these random processes are independent:

This assumption is more general and encompasses Eq. (20b). It is then possible to take both processes into account in the sample centered moments statistics. Hereafter, the expectation symbol thus applies to both processes together. Using results shown in appendix A, and with replaced by , replaced by , and replaced by , Eqs. (7), (9), (11), (13), and (14) have to be transformed since and are now random variables, to give, respectively,

and

or if the sample distribution is Gaussian,

For the sample fourth-order centered moment,

or if the sample distribution is Gaussian,

The difference between Eqs. (7) and (32) can be understood from an experimental point of view as if each one of the subsets (each of size *N*) had a different asymptotic covariance . For each subset, converges toward its own realization of the process if *N* goes to infinity. For a finite *N*, the average of over the subsets converges toward when goes to infinity.

### b. Theorem for linear filtering

Let and be the sample covariances and fourth-order centered moment, estimated through Eqs. (5a) and (5b), with a sample size . Let be obtained from a linear filtering of , and let be the associated filtering error. If assumptions (20a), (20b), and (31) are verified, then the linear filter optimality conditions (17a) and (17b) are equivalent to an optimality criterion:

where

The method to get this optimality criterion is to start from the covariance of sample covariances [Eq. (33)] and to replace all the asymptotic quantities using orthogonality properties resulting from the linear filtering optimality. The full demonstration of this theorem is given in appendix B.

In the case of a Gaussian sample distribution, Eqs. (25) and (36) simplify the optimality criterion (37) in

where

### c. Theorem for Schur filtering

For the Schur filter, we have the same set of equations as for the linear filter, but indices are now changed into , and products like cannot be transformed since the orthogonality property of Eq. (28) does not apply for cross products. The special treatment of this term that involves asymptotic values will be detailed in section 7.

Assuming that the same assumptions as in the previous theorem are still verified, then the Schur filter optimality condition (28) is equivalent to an optimality criterion:

where

In the case of a Gaussian sample distribution, Eqs. (30c) and (36) simplify the optimality criterion (41) in

where

## 6. Application to the linear filtering of variances

In this section, the criteria found above are applied to the linear filtering of variances. For sake of simplicity, the diagonal coefficients of covariance matrices , , and are gathered into variance vectors , , and , and the vector is such as .

### a. Simplification of the linear filtering criterion

Combining and from Eq. (38), it is possible to get rid of filtered covariances and get a new quantity involving filtered variances only:

since . If the sample distribution is Gaussian, Eq. (45) can be simplified in

If only sample variances are available for the filtering (and not sample covariances), the criterion (37) can be reduced to a subcase that is necessary but not sufficient to ensure filtering optimality:

where is derived from Eq. (45) with :

and similarly, if the sample distribution is Gaussian, criterion (39) can be reduced to

where is derived from Eq. (46) with :

### b. Homogeneous and isotropic filtering

Criteria (47) and (49) are fully general, since no assumption has been made on the practical way of computing them. However, it is then necessary to make an assumption on the way to estimate expectations from available data (often a single realization) in order to get usable formulations. Various ergodicity assumptions could be made based on the location, the scale, or the coordinate of any base that can decompose the raw variances. Assuming that expectations are estimated through *m* distinct averages in a certain base and that the filter depends on parameters, at least one set of the filter parameters for which all the *m* estimated criteria are verified should exist, if we want to have an optimal parameterization of the filter.

In this section, we choose to perform the filtering through a homogeneous convolution with an isotropic kernel, which is parameterized with one parameter only (i.e., ): its length scale . Such a length scale has to be 1) large enough to significantly remove the sampling noise and 2) small enough to keep the signal of interest (i.e., the asymptotic variances). Therefore, a spatial ergodicity assumption is made in order to estimate expectations in criteria (47) and (49) with a spatial average (i.e., ).

In the homogeneous case, criterion (47) is transformed in

where is derived from Eq. (48) for a spatial average :

and similarly, if the sample distribution is Gaussian, criterion (49) is transformed in

where is derived from Eq. (50) for a spatial average :

In appendix C, we show that if the spectral coefficients of the filter are monotonically decreasing when increases, which is the case for usual low-pass filtering functions (e.g., Gaussian and Lorentzian), then and are monotonically increasing functions of . As a consequence, criteria (51) or (53) are verified for a unique value of , and this value can be found iteratively. For example, the following pseudocode can be used to determine the filtering length scale:

Choose a normalized low-pass filter

*F*(i.e., preserving the mean ) and a large enough initial filtering length scale .Compute , and if necessary.

For a certain number of iterations:

Filter the sample variances with to get .

Compute and then or .

If or is positive (negative), then decrease (increase) the length scale to half its value.

In practice, about 10 iterations are enough to get a satisfactory precision on . For instance, with a basic dichotomy algorithm, if the maximum filtering length scale is estimated at 500 km, then the starting length scale is set to 250 km, and after 10 iterations the obtained accuracy is km. This is likely to be satisfactory in practice, even if more iterations could be done until the machine precision is reached.

From a practical point of view, Eq. (54) used in the criterion (53) can be understood as follows: whereas the first average is determined by the sample and remains the same during the filtering length scale optimization process, the second average is progressively decreasing as the filtering length scale increases, since the variations of around the horizontal mean decrease. When the quantity reaches 0, the ratio between the first and the second averages is , meaning that the sampling noise has been damped enough, while the asymptotic variances signal should not be damped further.

It should be noted that we are working with scalar criteria and to optimize filters depending on one parameter only, but that these criteria are approximations of the necessary but not sufficient optimality criteria and [given by Eqs. (48) and (50), respectively]. The full criteria and [given by Eqs. (45) and (46), respectively], which includes elements, should ideally be used to find the full optimal filter gain [Eq. (18a)], whose size is also . However, only approximated filters are available for high-dimension systems such as NWP models. It is difficult to assess the theoretical impact of these approximations, but the performances of approximated criteria and can be tested analytically, as done in section 8, or against a reliable reference, as done in Part II.

## 7. Application to covariance localization

A second application of the theory exposed in section 5 is the covariance localization deduced from an optimal Schur filtering. It should be noted that in the whole paper, localization are denoted with capital *L*, whereas length scales are still denoted with calligraphic .

### a. Localization with a Schur product

The localization of a sample covariance matrix is performed through a Schur product with a matrix , to get a filtered matrix :

The Schur product theorem (Schur 1911) ensures that the Schur product of two positive semidefinite matrices is also positive semidefinite. Thus, if is symmetric positive semidefinite matrix, so is . With this formulation, a square root of the filtered covariance matrix is easily available (Lorenc 2003; Buehner 2005):

where is the segment of an extended control variable corresponding to .

To compute the optimal localization matrix, it is possible to use results from theorems (41) and (43) about Schur filtering. However, there is no simple way in the previous theory to ensure that the optimal localization matrix is positive semidefinite. Thus, following formulae might lead to localization matrices with negative eigenvalues, the use of which can be damaging in data assimilation algorithms. Therefore, a fit of the diagnosed localization functions with positive semidefinite functions will be generally required before any further application.

Then, injecting this relation in Eq. (42) provides an explicit expression for the localization coefficient :

and if the sample distribution is Gaussian, Eqs. (44) can be rewritten in

### b. Diagnostic of localization using filtered variances

The expectation of the product of asymptotic variances can be estimated if variances are linearly filtered before the localization step, and if this filtering is optimal. Indeed, Eq. (25) gives at optimality of the linear filtering of variances

As drawbacks of this method, we notice that variances have to be computed and filtered before diagnosing the localization, and in a quasi-optimal way for Eq. (60) to be valid.

### c. Diagnostic of localization using sample variances

However, it is not even necessary to compute and use the filtered variances. Indeed, criteria and of Eqs. (45) and (46), respectively, can be set to zero in order to obtain expressions of for both general and Gaussian ensemble distributions.

In the general case, Eq. (45) gives at optimality

so that Eq. (58) can be rewritten as

If the ensemble distribution is Gaussian, then Eq. (46) gives at optimality

so that Eq. (59) can be rewritten as

### d. Diagnostic of localization using sample correlations

For some applications, it might be interesting to compute localization diagnostics from estimations of the sample correlations only. Defining the element of sample correlation matrix as , we can split in

with

so that in the case where , it is possible to simplify Eq. (64) to get a first-order approximation of :

Since the sample correlation is bounded between −1 and 1, estimations of through spatial and angular averages are rather robust, even if is heterogeneous and anisotropic. However, the accuracy of the approximation cannot be analytically quantified. Indeed, it depends on the ensemble size *N* and on the higher moments of the distribution of . Nevertheless, this approximation should be better at small separations, since converges to 1, so that converges to 0.

Contrary to what can be usually found in the literature, localization functions diagnosed from all methods in this section are lower than 1 at zero separation. For instance, Eq. (67) gives

This is due to the fact that the optimal Schur filter is biased, so that the gain is always lower that 1, as shown in Eq. (27). It means that the diagnosed localizations also try to damp the sampling noise on variances by decreasing their value.

### e. Spatial and angular averages

To estimate the expectations in formulae (62), (64), and (67), it is necessary to make an ergodicity assumption. As for variance filtering, several assumptions could be possible, but we keep the simplest here: the spatial and angular ergodicity assumption, leading to homogeneous and isotropic localization functions. The spatial and angular averages are combined in a single operator , such as is a function of the separation *r* that gives the spatial and angular average of the matrix .

Denoting the matrix such as , formulae (62), (64), and (67) are thus respectively transformed to give localization functions:

and

## 8. Illustration in an idealized framework

The aim of this section is to provide a simple illustration of what has been demonstrated above, and an evaluation of possible suboptimal effects associated with the prespecification of a chosen analytical filter, in a 1D idealized framework for which the expectation over both random processes and is assumed to be computable. Applications to real meteorological fields from an NWP model are shown in Part II.

### a. Analytical framework

To get rid of resolution and domain size problems, we use a continuous 1D representation on an infinite domain. We define a simplified model of asymptotic covariance , split into heterogeneous variances and homogeneous correlation . We assume that asymptotic variances are generated from a random process that is scaled so that it has a unit expectation and homogeneous covariance , where *s* is the standard deviation of and is its correlation function. For sake of simplicity, the asymptotic correlation function is assumed to be exactly the same at every realization of its generating random process , with no variability. The spatial coordinate *x* is scaled so that the correlation length scale of is unit. Finally, we assume that the sample distribution is Gaussian, so that the covariance of sample covariances is expressible with the previous quantities.

### b. Variance filtering

In a continuous framework, the homogeneous linear filtering of variances is given by the convolution with a kernel *F*:

Therefore, Eq. (50) can be expressed in a continuous way:

Since the product of Gaussian functions is Gaussian too, and since the integral of a Gaussian function is known, we define *F* as a normalized Gaussian kernel of length scale , and we assume that and are also Gaussian kernels of respective length scales and :

and

Only *F* is normalized so that its integral is 1. The terms and are correlation functions, so their value at origin must be 1. Thus, Eq. (71) can be computed explicitly to get

Finally, we can find numerically the filtering length scale such as for every set of parameters.

In Fig. 2, the filter found from the criterion for a Gaussian filter *F* is compared to the theoretical filter given by Eq. (18a) and transposed in a continuous framework here:

where denotes covariance function of *υ*, which is homogeneous and therefore depends on one coordinate *x* only. While some differences are visible at the origin and at nonzero separation (due to the suboptimality of the chosen Gaussian filter), the two filters are relatively similar on the whole. Moreover, using the criterion with a Gaussian filter provides a strictly positive filter, which ensures that filtered variances are always positive.

For any homogeneous linear filter, the mean square filtering error is given by . One can demonstrate that if all functions are Gaussian, can also be computed explicitly:

The consistency of the asymptotic behavior of this expression can be verified:

If then : without filtering, we get the exact value of the sampling error in our framework.

If then : for a global average, the mean square error is equal to the homogeneous spatial variance of .

If then : the filtering is causing a nonzero error even if the sampling error vanishes, unless .

Equations (73) and (75) are useful to investigate the efficiency of a homogeneous filtering and the ability of our simplified criterion (53) to find the optimal filtering length scale.

In our framework, there are three relevant parameters to assess the filtering behavior:

The ensemble size

*N*is directly involved in the sampling noise amplitude, as shown in Eqs. (33) or (34). We expect a larger filtering length scale and a more effective filtering as*N*decreases, since the sampling noise increases.The scale of the variations of asymptotic variances is given by and compared to the scale of the sampling noise, given by (Raynaud et al. 2009). We expect a larger filtering length scale and a more effective filtering as increases, because asymptotic variances signal and sampling noise are then easier to separate.

The standard deviation of asymptotic variances relative to their mean value is given by

*s.*It is straightforward to show from Eqs. (33) or (34) that the sampling noise increases where the asymptotic variances are larger, so that the discrepancies in sampling noise amplitude increase with*s.*We expect a larger filtering length scale and a more effective filtering as*s*decreases for two reasons: the noise amplitude is more homogeneous and a stronger filtering can be applied (larger ) without significantly degrading the asymptotic variances signal.

The variations of the filtering length scale found from the criterion (73), as a function of the relevant parameters, is given in Fig. 3. As expected, we get a larger filtering length scale when the asymptotic variances are smoother (i.e., increases), whereas we get a smaller when both the standard deviation (*s*) and the ensemble size (*N*) increase. Thus, the criterion (53) provides a filter that adjusts its length scale coherently with the sample variances texture and uncertainty. Figure 4 confirms that the filtering is more beneficial, in terms of mean square error relative variations, for smoother variances (large ), with reduced standard deviation *s* and a smaller ensemble size *N*.

Regarding the optimality of the filtering length scale found from the criterion, Fig. 5 shows that this is almost the case, even if there is an overestimation of the filtering length scale for small ensemble sizes due to the suboptimality of the chosen Gaussian filter. However, for the use of variances in modeled background error covariances, this slight overfiltering would not be harmful, since in that case we prefer to lose a bit of signal on variances than to keep sampling noise on them.

### c. Covariance localization

In the case of a Gaussian sample distribution, the variance of the sample covariance error is deduced from Eq. (34), to give

In our simplified framework, the asymptotic correlation is again assumed to be the same at every realization of its generating random process , so that and

Thus, introducing this result in the optimal localization of Eq. (59) gives

The associated length scale can be derived from the classical formula of Daley (1991):

Noticing that the correlation function has the properties

and

it is possible to show that

so that the localization length scale of Eq. (79) can be expressed as

We would like to emphasize that this relationship between correlation and localization length scales should not be used to adjust a parameterized localization function (for instance, of Gaussian shape) from a diagnosed correlation length scale. Indeed, if the shapes of the prescribed localization function and the optimal localization function are different, then the most appropriate length scale for the prescribed function might be very different from the one given by Eq. (82). It should not be forgotten that length scales provide information about the curvature of functions at their origin, but no insight about their actual extension, which is decisive for a localization function.

The localization functions computed from Eq. (78) are illustrated in Fig. 6 for several ensemble sizes and asymptotic correlation shapes: the well-known Gaussian shape and the Lorentzian shape, defined as

As the ensemble size decreases, the localization function is smaller and has a reduced horizontal extension. This can be easily interpreted from the higher level of noise that spoils covariances estimated from smaller samples, especially at larger separation.

We can compare the localization given by Eq. (78) to the compact-supported function of Gaspari and Cohn (1999) that can be frequently found in the literature (Hamill et al. 2001; Lorenc 2003). To measure the efficiency of any localization , the mean square error of any localized covariances can be computed as

For this experiment, we used a Gaussian covariance for asymptotic variances with an amplitude and a length scale , and an ensemble size of 80. Figure 7 shows that the Gaspari and Cohn (1999) function, even with an optimized length scale that minimizes the average mean square error, is always outperformed by the localization of Eq. (78), since the difference of their mean square errors is always positive. We notice that, in our case, the optimal localization function always has a flatter top than the Gaspari and Cohn (1999) function, followed by steeper slopes and slighter tails. This is consistent with the mean square error difference maxima, which arises on each side of the steep slope. Thus, the Gaspari and Cohn (1999) function seems better adapted for situations characterized by a peaked rather than a more Gaussian-shaped correlation function.

In nonidealized cases, the practical estimation of the localization is not possible from Eq. (78), but only from Eqs. (69a), (69b), and (69c). Figure 8 shows that in our idealized framework, these formulations are almost equivalent. As expected, they provide a good approximation of the localization function for small separations, but have smaller tails. The noise that appears at large separation is not a real issue. Indeed, and as it is shown in Part II, a fit of these diagnosed localizations with positive-definite functions, in which the large-separation noise will be damped, is still required before any further use.

The shape of diagnosed localization functions at small separation can be quantified through the localization length scale . Figure 9 shows that the length scales of localization functions from Eqs. (69a), (69b), and (69c) give similar results and are very close to the theoretical one given by Eq. (82), especially for ensembles with a large number of members. As expected, the localization length scale increases with the ensemble size *N*.

## 9. Conclusions

Ensemble data assimilation methods are powerful techniques for estimating background error covariance, but associated error covariance estimates are inevitably affected by sampling noise, which should be taken into account and filtered out.

The core of this article is a demonstration that optimality criteria can be found for linear and Schur filtering of covariances, which involve sample estimated quantities and filter output only. No knowledge of the properties of the true signal is required—instead of usual optimal filters—which makes these criteria easy to estimate with simple assumptions (e.g., spatial ergodicity). To obtain these results, it is necessary to use elements from two distinct theories: the optimal linear filtering theory and the centered moments estimation theory. Only a few basic assumptions are required: 1) the sampling noise on covariances should be unbiased and not correlated with the asymptotic covariances, and 2) the two involved random processes allowing the successive generation of asymptotic and sample centered moments are independent.

Once the general criteria are obtained, simplifications can be considered for specific applications. In this paper, we have dealt with the linear filtering of variances and with the localization of covariances through a Schur product, but other applications could be designed from the general criteria [e.g., filtering of Local Correlation Hessian tensor from Weaver and Mirouze (2013)]. For the linear filtering of variances, it has been shown that if a homogeneous and isotropic filtering with a well-suited one-parameter-dependent function is chosen, then this parameter (usually the filtering length scale) can be efficiently optimized iteratively using a homogeneous subcriterion with a fast convergence. An attractive advantage of this method is the insurance of positiveness for the filtered variances. Concerning covariance localization, the exact formula can be either computed using sample variances and covariances or approximated using sample correlations. The theoretical efficiency of these new methods has been detailed and illustrated in an idealized framework, showing expected behaviors. Their application in a full convective-scale NWP model is detailed in Part II.

## Acknowledgments

This work has been supported by the École Normale Supérieure of Paris.

### APPENDIX A

#### Random Variables with Random Moments

Let be a random vector, whose moments are known up to order two. Its expectation and its covariances are respectively given by

and

Let be a square root of , from which we can express without loss of generality as

where is a random unbiased vector of identity covariance matrix:

and

Higher-order moments of are determined from those of and from .

In a second step, we now assume that and are also random variables—not necessarily independent, but independent from

and

From these assumptions, we can check that

and that

As a conclusion, when the expectation and the covariance of two random variables become random variables themselves, the new expectation is given by the expectation of the initial expectation [Eq. (A5)] and the new covariance is given by the covariance of the initial expectation , added to the expectation of the initial covariance [Eq. (A6)].

### APPENDIX B

#### Criterion of Optimal Linear Filtering of Covariances

and

##### a. Direct property

The sample covariances error can be rewritten as

At filter optimality, thanks to the orthogonality of the filtering error with the sample estimation [Eq. (22)], several terms vanish in the covariances of to get

If we take Eq. (33) and replace

we finally get

which is equivalent to

##### b. Inverse property

To show the inverse property, we assume that the Eq. (B7) is verified, and we show that the linear filter is necessarily optimal. Assumptions (20a) and (20b) imply that

and also that

Equation (B10) can be rewritten for permutations of indices , which builds a linear system of 24 equations with 24 unknowns: the couples and their indices permutations. The determinant of this system can be computed explicitly with any formal computation software: it is a fraction of two polynomials in *N* of order 36 and 12. This determinant vanishes only for . If the sample has more than three members, then the linear system is invertible, so

and they are all the same for all indices permutations.

### APPENDIX C

#### Homogeneous and Isotropic Variance Filtering

##### a. Properties in spectral space

Sample and filtered variances and counterparts in spectral space are respectively denoted and . The Plancherel formula gives the spatial average of a Schur product of two gridpoint vectors as a spectral average of a Schur product of their spectral counterparts:

where *K* is the maximum wavenumber index, and the overline denotes the conjugate complex number. From Berre (2000), we know that a homogeneous and isotropic filtering in gridpoint space is equivalent to a diagonal filter in spectral space:

where are real coefficients of the spectral filter, lying between 0 and 1, with to retain the spatial average. With usual filtering function shapes, we can check that for ,

since for increasing filtering intensity (increasing ), the coefficients go to 0.

For instance, in a continuous case, the Fourier transform of a Gaussian filter of length scale is given by

so that

##### b. Unicity demonstration

In the general case, the spectral transform of is denoted . Equation (52) can thus be expressed in spectral space by

Its derivative with respect to one of the coefficients is given for by

so that

which proves that is strictly increasing.

Moreover, if there is no filtering , so that

and

and from the convexity property,

hence,

so finally,

and

As a conclusion, the optimality condition is obtained for a unique value of the filtering length scale .

## REFERENCES

*Proc. ECMWF Workshop on Flow-Dependent Aspects of Data Assimilation,*Reading, United Kingdom, ECMWF,

*The Analysis of Time Series: An Introduction.*Chapman and Hall/CRC, 333 pp.

*Atmospheric Data Analysis.*Cambridge University Press, 460 pp.

*ECMWF Seminar on Recent Developments in Data Assimilation for Atmosphere and Ocean,*Reading, United Kingdom, ECMWF, 45–63.

*Aspects of Multivariate Statistical Theory.*John Wiley & Sons, 673 pp.

*J. Reine Angew. Math.,*

**140,**1–28.

*Extrapolation, Interpolation, and Smoothing of Stationary Time Series, with Engineering Applications.*MIT Press, 163 pp.