## Abstract

This study proposes a variational approach to adaptively determine the optimum radius of influence for ensemble covariance localization when uncorrelated observations are assimilated sequentially. The covariance localization is commonly used by various ensemble Kalman filters to limit the impact of covariance sampling errors when the ensemble size is small relative to the dimension of the state. The probabilistic approach is based on the premise of finding an optimum localization radius that minimizes the distance between the Kalman update using the localized sampling covariance versus using the true covariance, when the sequential ensemble Kalman square root filter method is used. The authors first examine the effectiveness of the proposed method for the cases when the true covariance is known or can be approximated by a sufficiently large ensemble size. Not surprisingly, it is found that the smaller the true covariance distance or the smaller the ensemble, the smaller the localization radius that is needed. The authors further generalize the method to the more usual scenario that the true covariance is unknown but can be represented or estimated probabilistically based on the ensemble sampling covariance. The mathematical formula for this probabilistic and adaptive approach with the use of the Jeffreys prior is derived. Promising results and limitations of this new method are discussed through experiments using the Lorenz-96 system.

## 1. Introduction

The Kalman filter was first proposed by Kalman (1960) under the assumption of linear dynamics, knowledge of certain covariance information, and Gaussian random variables. Evensen (1994) first introduced the ensemble Kalman filter (EnKF) into this context. EnKF takes advantage of a short-term ensemble to estimate flow-dependent background error covariance, the computational cost of which is much more affordable than the original Kalman filter for large dimensional computation. As variations of EnKF, several ensemble square root filters (EnSRF) have been developed [for a summary see Tippett et al. (2003)] in order to maintain consistency between the ensemble and the covariance (see Houtekamer and Mitchell 2001; Anderson 2001; Bishop et al. 2001; Whitaker and Hamill 2002; Snyder and Zhang 2003). Ott et al. (2004) designed the local ensemble transform Kalman filter so the state variables can be updated completely in parallel. Meanwhile people are devoting efforts to combining EnKF with three- and four-dimensional variational data assimilation (3D/4D-Var; e.g., Hamill and Snyder 2000; Lorenc 2003; Wang et al. 2008a,b; Zhang et al. 2009, 2013). In the ensemble Kalman filters, sampling error is one of the common problems.

Since the ensemble size is usually far less than the state dimension in geoscience applications, it is impossible for the sample covariance calculated by the ensemble members to capture all the main properties of the true covariance. Some of the sampling errors can be eliminated when the physical correlation radius is small compared to the spatial dimension of the state, which is usually the case in numerical weather prediction. Houtekamer and Mitchell (2001) and Hamill et al. (2001) used localization to remove spurious correlations between distant variables. The localization method can be used in all ensemble-based filters and it has been shown to be a powerful tool for limiting sampling error due to a small ensemble size. The Gaspari–Cohn (GC) localization function (Gaspari and Cohn 1999) is symmetric positive definite and widely used for a wide range of applications. Alternative methods include Jun et al. (2011) who considered estimating the background covariance by smoothing the sample covariance using a kernel function. The GC function is effective for spatial localization, but the localization radius in most applications until recently is chosen empirically. Anderson (2012) sheds some new light on this problem by making use of a probabilistic approach. Anderson and Lei (2013) further extended the probabilistic approach to determining the localization radius based on observation impact. Lei and Anderson (2014) chose the localization radius to be the one that minimizes the difference between sample covariance and localized sample covariance computed from different samples. Meanwhile Bishop and Hodyss (2007) and Bishop et al. (2011) proposed the use of entry by entry powers of sample correlation together with a nonadaptive empirical localization function. The method proposed in the current study is based on a sequential ensemble square root filter. For covariance inflation we utilize the relaxation method from Zhang et al. (2004). To clarify, we present the algorithm in detail in Table 1. Moreover we would like to mention that the method under discussion in this manuscript only applies to the case when observations are uncorrelated. Further generalizations of this approach to correlated observation errors will be the subject of future research.

The structure of this paper is as follows. In section 2 we propose a variational approach through minimizing a cost function to find the optimum localization radius in the case that the true covariance is known. In section 3 we generalize the method to the case where the true covariance is unknown but can be represented probabilistically from the sample covariance. Numerical results about the performance of this method are presented through sections 2, 3, and 4. In particular, section 4 contains the numerical results based on the Lorenz-96 system. Section 5 provides a summary and discussion on the proposed localization method.

## 2. Optimal localization with true covariance: A cost function approach

As mentioned earlier, when the physical correlation radius is small the sampling noise at far distances can be largely reduced by localization. Nevertheless, the benefits of this process can occur at the expense of removing small but nonnegligible true covariance beyond the cutoff distance and lowering the true covariance within the localization radius. Figure 1 is a visual illustration of the effect of covariance localization. In this experiment, we use the GC correlation function as the true covariance by setting the true radius of influence (ROI) equal to 5. We draw 61 sample vectors from the Gaussian distribution (0, ^{true}) and compute the sample covariance. Using a localization radius of 24, the localized sampling covariance is then compared with the raw sampling covariance and the true covariance. We can clearly see that the localization method effectively reduces the spurious correlation at far distances.

### Cost function F_{0}

To gain further understanding of this problem we first examine the solution in the case where the true covariance is known, but we acknowledge that this is impossible in practice. In later sections of this paper we show how to remove this restriction and apply this technique to real problems where the true covariance or Kalman gain is not known. For this scenario we construct a way of describing how “good” the localization radius is for a given ensemble by associating it with the value of a cost function *F*_{0}. Recall, when one observation **y** is available, the updating formula of state variable **x**_{i} using the algorithm in section 1 is as follows:

where is the mean update of the *i*th state variable, Δ**y** is the innovation, is the regression coefficient calculated from the sample, and *ρ*_{i} is the value of localization function for the *i*th state variable and observation **y**.

If we know the true covariance ^{true} we can simply substitute ^{true} for ^{s} without doing localization. Hence, the exact Kalman updating formula becomes

where

is the true regression coefficient (perfect Kalman gain). Here **h** is the observational operator and is the observational covariance.

A natural way of comparing the difference between using the true covariance and the localized sample covariance is represented by the following cost function:

The optimal localization radius should be the ROI that minimizes the cost function *F*_{0}. Now let us consider the case when we have more than one observation available at the same time. We can find the optimal ROI for each observation and then find the maximum likelihood estimate of the ROI and choose it to be our overall optimal ROI.

Based on the above analysis, we propose the algorithm A0 (Table 2) (while its results are shown in the left part of Table 3) for evaluating the optimal ROI. In algorithm A0, denotes the sample regression coefficient for the *j*th observation and *i*th grid point. Similarly is the regression coefficient computed by using the true covariance for the *j*th observation and *i*th grid point; *d*_{ji} is the distance (in number of grid points) between the *j*th observation and *i*th grid point. In the left half of Table 3 and the left panel of Fig. 2, we show the value of ROI^{(0)} in terms of grid points for different true covariances and ensemble sizes. Given true covariance of GC form with correlation radius ROI = 2, 5, 10, or 20, we draw samples of size *N* = 11, 21, …, 121 from the distribution (0, ^{true}) and compute the sample covariance ^{s}. We assume observations exist at every grid point and have an error variance of 0.04. Then we compute the ROI^{(0)} using algorithm A0 for each pair of ROI and *N* values. We do this test 1000 times for each pair of parameters and estimate the probability density of ROI^{(0)} and plot the curves of (rescaled) probability density in Fig. 2. In Table 3 (left half) we show the maximum likelihood estimate of ROI^{(0)} for each choice of ensemble size and radius of influence of the true covariance.

For all choices of parameters in this experiment, the resulting maximum likelihood estimate of ROI^{(0)} is always larger than the true ROI. This is desirable since our goal is to find a localization that can reduce the spurious correlation at a far distance while also approximating the true correlation at a near distance. Second, we can see that for a larger ensemble size (or ROI) the resulting ROI^{(0)} also gets larger. We also find that as the true ROI increases for a fixed ensemble size, the increment of ROI^{(0)} is almost linearly proportional to the increment of ROI. That is likely due to the Gaussian assumption. More discussions on Table 3 and Fig. 2 will be presented in section 3.

## 3. A probabilistic method of localization

In more common and realistic cases the true covariance is almost always unknown so the algorithm A0 is quite limited for real-data implementations. Anderson (2012) introduced a probabilistic method by assuming that the true correlation between a scalar variable and a scalar observation is a random variable based on the ensemble. And in Lei and Anderson (2014) they choose the localization value that minimizes the difference between the sample regression coefficient and the localized sample regression coefficient from another pair of the same type of variables from another ensemble group of the hierarchical ensemble group filter. We follow a similar approach to Anderson (2012) in this study. More precisely speaking, we compute the optimal localization radius whenever any observation is available for data assimilation. We assume that the true covariance between a scalar observation and all nearby state variables follows a specified error distribution so we can compute the average value of some cost function over that distribution. Part of the cost function we will present has some resemblance to Eq. (1) in Lei and Anderson (2014). However, the main technique we employ in this manuscript is considerably different from that in Lei and Anderson (2014), which does not assume any prior distribution of the true covariance but instead compares the difference between the sample regression coefficient with the localized sample regression coefficient of the same type, under their definition, of pairs of observation variables and state variables from a different ensemble group. We use the Jeffreys prior and the Wishart distribution to calculate the conditional distribution of the true covariance matrix rather than using the sample covariance computed from another ensemble group member as in Lei and Anderson (2014). It may also be worth noting that the method in Lei and Anderson (2014) finds the spatially homogeneous localization while the scheme we propose in this manuscript can be applied for any location that is spatially varying. In our proposed method of this section is still the true regression coefficient, but it is now a function of the true covariance represented by a random matrix. Therefore, the regression coefficient is a random number. The distribution of the true covariance for a given sample comes directly from Bayes’s theorem:

In the above formula *p*(^{s}) is a constant for a given sample and *p*(^{s} | ) is the well-known Wishart distribution. Since we assume the prior distribution is Gaussian, it has been mathematically proved that the distribution of sample covariance is exactly the Wishart distribution (see Muirhead 1982); *p*() is called the prior distribution, the choice of which is not unique. Here we use the Jeffreys prior, which is proportional to the square root of the determinant of the Fisher information. We do not claim that Jeffreys prior is the most appropriate distribution, but this is commonly used in the statistics community (see Bernardo and Smith 1994).

The average of the cost function *F* over all true (random) covariance is

One limitation is that the Wishart distribution is valid only when the sample size is larger than the dimension of the state variable, which cannot be satisfied in most of the applications. If the sample covariance is not of full rank, then the space of ^{s} is not an open subset of the space of all symmetric positive definite (SPD) matrices. In this case, the space of singular semipositive definite matrix is a surface of the full space of SPD matrix. And the probability distribution lies on this surface. But the true covariance is usually of full rank, hence, it does not lie in this subspace. Though people have studied the analog of the Wishart distribution in the case that ^{s} is of deficient rank, we do not use it for the following two reasons:

the true covariance may not lie in this subspace and the subspace is not flat;

mathematically it is harder to simplify the cost function to a computable form.

Therefore, we consider computing the truncated average of the cost function instead of the overall average. Specifically speaking, we consider

where (or _{Loc}) is the sample (or true) covariance of the state variables that are within distance ROI from the observation. ROI can only be chosen among those that the amount of state variables within the distance ROI from an observation is strictly less than the ensemble size *N*. Intuitively this truncated average only takes into account the influence of the observation on the state variables that lie in the “radius of influence .” This restriction also causes the resulting ROI to be extremely small if we directly apply this method to 3D GCMs. This aspect will be studied further in the future.

### a. How to choose the cost function

If we use exactly the same cost function *F*_{0} for the truncated average as in section 2, where the true covariance is known, we find that the variational method often leads to a trivial solution (ROI = 0, not shown). We believe part of the reason is that the truncation ignores all information outside of the radius of influence. In other words, the localization radius resulted by using the function *F*_{0} in Eqs. (5) and (6) might be much different if we instead use a truncated version of *F*_{0}. To limit the loss of information, we consider a modified cost function:

The advantages of the function *F*_{11} are the following:

Combining the above properties we can see that for a given true covariance, substituting *F*_{11} for *F*_{0} in algorithm A0 is mathematically equivalent. More importantly, in the case that the true covariance is unknown, the truncated average when use Jeffreys prior can be simplified explicitly so the integration of thousands of variables can be avoided. However, because of the use of the uninformative Jeffreys prior as the true covariance probability distribution function, we find that an additional penalty is necessary to be included in the cost function to avoid a trivial solution where the resulting ROI is simply the largest possible radius of influence allowed by the algorithm. Therefore, we define

where the second term is a penalty term that can be interpreted as the sampling noise or stability term that accounts for the difference between the true covariance and sampling covariance scaled by the localization coefficient.

Now we present the complete form of the cost function that we designed for the case when true covariance is unknown:

where *ρ*_{i} = GC(ROI, *d*_{i}).

For the exact mathematical expression/definition of and *d*_{Loc}, please refer to the appendix. Note that this integral is taken over thousands of variables so it cannot be directly computed numerically. Mathematical simplifications to Eq. (11) can be made under certain assumptions. A computable form of this expression is derived in the appendix.

#### Theorem 3.1

Suppose there is a scalar observation available at this time and the observational variance is *.* The observational operator **h** = (0, 0, …, 0, 1, 0, …, 0). Let *k* be the state variable that the observation is taken for*.* Let *n*_{Loc} be the number of state variables that are within distance ROI from the observation. Let *ϵ*_{i} ∈ ℝ^{1×N} be the forecast perturbation of the *i*th component of the state variable*.*

Let

Let

Then

where denotes the expectation of *f*(*x*) when *x* follows the distribution .

Based on the above theorem we propose algorithm A1 (Table 4) to adaptively determine the localization radius. The computational cost of this algorithm is *O*(*N*^{2}*m*).

### b. Comparison of *ROI*^{(0)} and *ROI*^{(1)} for a given true covariance

We present the experimental results in Table 3 and Fig. 2 for a comparison of ROI using the algorithm A0 presented in section 2 [ROI^{(0)}] versus the new probabilistic algorithm A1 [ROI^{(1)}]. The experiment for ROI^{(0)} is still the same as that presented in section 2, where the true covariance is known. In addition, we add columns 7–11, which are values of ROI^{(1)} computed by algorithm A1 for the same true covariance and sample covariance that are used in columns 2–6, though the information of true covariance is completely unknown in algorithm A1. Again we do 1000 tests for each experiment and estimate the probability density of ROI^{(1)} for each experiment. We plot the (rescaled) probability density function of ROI^{(1)} on the right panels of Fig. 2 and list the maximum likelihood estimates (MLE) of ROI^{(1)} in the right panel of Table 3. Note that the last row of Table 3 is the maximum value of ROI^{(0)} and ROI^{(1)} allowed by algorithm A0 and A1 respectively. The upper bound for ROI^{(0)} can be modified arbitrarily with the larger upper bound translating into a larger range of possible solutions. ROI_{max} = 130 is found to be a large enough searching range for our application. The upper bound for ROI^{(1)} is not as flexible as in the ROI^{(0)} case. In this experiment and the experiments in the next section, the maximum of ROI^{(1)} is chosen to be . Therefore, when the ensemble size is small and if the ROI^{(1)} tends to be large, the ROI^{(1)} of the 1000 independent tests would simply concentrate at the maximum value allowed by the algorithm. This is why we see some peculiar shape of the black curves in the right panels of Fig. 2 when the ensemble size is small.

When the true ROI = 2, ROI^{(0)} varies from 2.5 to 5. For the algorithm that uses no knowledge of the true covariance, ROI^{(1)} changes from 2.2 to 17, which has a much larger range than ROI^{(0)}. For example, when *N* = 11, the MLE of ROI^{(0)} is about 1.3 times the true ROI. When *N* = 121, the MLE of ROI^{(0)} is about 2.6 times the true ROI. Note that now there is no similar linear relationship between the MLE and ROI for ROI^{(1)} [which is found for ROI^{(0)}]. When *N* = 11, ROI = 10 and 20, the probability density of ROI^{(1)} all concentrate at a single point, which causes the graph of the (rescaled) probability density function to not be seen in Fig. 2. Similarly for *N* = 21 and 31, the probability density of ROI^{(1)} in Fig. 2 is very concentrated near the maximal possible ROI (i.e., ROI_{max}). This is because algorithm A1 does not allow ROI^{(1)} to be larger than and the best radius of influence in the search range is very likely to be the upper bound ROI_{max}. When *N* = 121, the distribution of ROI^{(1)} is far away from the upper bound ROI_{max} but the MLE of ROI^{(1)} does not depend linearly on the true ROI neither.

At first glance, the differences between ROI^{(0)} and ROI^{(1)} are quite large over the set of parameters that we examine. When *N* = 121, the MLE of ROI^{(1)} is larger than that of ROI^{(0)} when the true ROI= 2, 5, and 10 but smaller than that of ROI^{(0)} when the true ROI = 20. This result is not surprising because in algorithm A1 the true covariance is completely unknown as in real-data applications. On the other hand, the values of ROI^{(1)} are close to ROI^{(0)} when *N* = 61. Moreover we see that when the ensemble size is small ROI^{(1)} is also small though the true covariance has larger ROI. We also see that for a larger ensemble size, both algorithms result in larger ROIs. Therefore, the algorithm adjusts the ROI according to the expected sampling error for a given data assimilation cycle.

Although *F*_{11} would give the same value if we slightly enlarge the region allowed by the Wishart distribution, the integral would be different due to the use of the noninformative Jeffreys prior. As a consequence, if we only use *F*_{11} as our cost function the resulting ROI^{(1)} would just be ROI_{max} in most situations. This is why the second term in *F*_{10} is required for making the solution nontrivial. On the other hand, the choice of prior does not necessarily have to be the Jeffreys prior. The penalty term can have other choices as well. In this sense, the method is still quite empirical when the true covariance is unknown. Future studies will explore the use of more informative prior probability distribution functions for optimizing the ensemble localization distance.

## 4. Application to the ensemble covariance generated by the Lorenz-96 model

### a. Experiments with the Lorenz-96 system and approximated true covariance

Now we design experiments to compare algorithms A0 and A1 in a Lorenz-96 system (Lorenz 2006). Here the system is configured to have 120 variables and 30 uniformly distributed observations, which lie on the model grid points. The external forcing *F* is set to 8, the time step *dt* is set to 0.05, and observations appear every two time steps. We use an error variance of 0.04 for all observations in these experiments. Observational operator **H** is simply the restriction operator as required by the current version of the algorithm. To get an approximated true covariance in the Lorenz-96 system, we first run 6000 ensemble members for *T*_{1} time steps with the EnKF data assimilation. We then run the ensemble members for *T*_{2} time steps without data assimilation. Then we use the whole 6000 ensemble members to get an approximation of the true covariance ^{true}. In the first *T*_{1} time steps, we use covariance relaxation with *α* = 0.5 (Zhang et al. 2004). Then, we draw *N* ensemble members from the 6000 members and compute the sample covariance ^{s}. Finally we compute the optimal ROI^{(0)} using algorithm A0 with the known (approximated) true covariance, and compare it with the optimal ROI^{(1)} that was computed by algorithm A1 with no knowledge of the true covariance. We choose *α* = 0.5 here because the model still has a chance to blow up when no localization is used, even if when the ensemble size is 6000. We did 1000 tests for each set of the parameters and we estimated the probability density of the optimum ROI with respect to both algorithm A0 and A1. The graphs of probability density function are shown in Figs. 3 and 4 and the maximum likelihood ROI^{(0)} and ROI^{(1)} are listed in Table 5.

We do not expect that algorithms A1 and A0 give similar values in all cases, because in algorithm A1 we do not use the true covariance while in algorithm A0 we have the complete information of true covariance.

When *T*_{1} = 50 (*T*_{1} is the EnKF analysis duration), *N* = 61, the value of ROI^{(0)} = 17.7, 18.6, and 10.6 for different *T*_{2} (*T*_{2} is the length of free ensemble forecast without assimilation). We see ROI^{(1)} = 19.4, 19.5, and 11 for different *T*_{2}, respectively, which are very close to the values of ROI^{(0)}. But for *T*_{1} = 500 (Fig. 4), we see the probability density function of ROI^{(0)} is quite peculiar for *T*_{2} = 1 and 10 (blue and red curves in the left panels) when the ensemble size is small. The maximum likelihood of ROI^{(0)} are around 7.7, 2.5, and 8 for different *T*_{2} no matter if *N* = 11, 21, or 31. Although it is not clear why *T*_{2} = 10 results in ROI^{(0)} ≈ 2.5, which might be due to the system dynamics, this shows that ROI^{(0)} only changes a little when ensemble size changes among *N* = 11, 21, and 31. The estimated probability densities of ROI^{(0)} for these parameters have similar shape as can be seen in first three panels on the left of Fig. 4. Now, if we compare the graphs on all of the left panels of Figs. 2, 3, and 4, we see that although the (approximate) true covariance is known in all of these cases, there is a difference in the shapes of the probability distributions for ROI^{(0)}. It can be clearly seen that in the experiments for the left panels of Fig. 2 the probability density function curves are approximately symmetric about their axis of symmetry, in which case the sample is exactly drawn from a Gaussian distribution. As a comparison, we see that for tests using small ensemble sizes, which is for the left panels of Figs. 3 and 4, the curves do not have symmetry at all. This is probably because the distributions of state variables in the Lorenz-96 system are not close to Gaussian. For *N* = 21, *T*_{1} = 50, *T*_{2} = 10 (red curve in Fig. 3, *N* = 21), the curve even has two peaks. In this case it would be hard for us to determine a good localization radius even if we know the true covariance. For *T*_{1} = 500, *T*_{2} = 100 (the black curves in Fig. 4), the behavior of ROI^{(0)} becomes close to that when *T*_{1} = 50, *T*_{2} = 100 for all ensemble sizes. This might be because after running the model for 100 time steps without data assimilation, the distribution of state variables become close to the “climatological distribution.” But as we compare the results in this case with the numerical results in section 4b, we see that ROI around 11 does not give optimal RMSE for the long time real implementation. However, the curves on the right panels of Figs. 3 and 4 are more symmetric. This might because our probabilistic method still assumes Gaussian distribution and we do see in the cases *T*_{1} = 50 or 500, *T*_{2} = 1 or 10, ROI^{(1)} would result in smaller RMSE according to numerical results in section 4b, hence, it is a better value for real-data implementation.

### b. Experiments that use algorithm A1 to do adaptive localization in Lorenz-96 system

Now we implement algorithm A1 in the Lorenz-96 system. As mentioned before, the Lorenz-96 system configured here has *n* = 120 state variables and is integrated using the Runge–Kutta fourth-order scheme. The observations are uniformly distributed and lie on some of the grid points. The number of observations *m* and ensemble size *N* are different for each case. To handle filter divergence, we use a relaxation coefficient *α* = 0.5 (Zhang et al. 2004). The inflation method is necessary here to prevent the model from blowing up. The choice of *α* = 0.5 is not tuned. We also did the same experiment for *α* = 0.25 and the results are similar. The choice of *α* can slightly influence the resulting ROI^{(1)}, but the change is ignorable at least for this experiment, so we only show the results for *α* = 0.5. At every time step when observations are available, we first compute ROI^{(1)} using algorithm A1, then we use ROI^{(1)} to be the localization radius before assimilating the observations.

#### 1) Sensitivity to ensemble size

#### 2) Comparison of RMSE for different observational density

In this subsection, we fix the ensemble size at *N* = 61 and plot the spatial RMSE of the ensemble mean as a function of time for times steps between 1 and 50, 1000 and 1200, and 1200 and 1400. These values are plotted for *m* = 30, 60, and 120, in Figs. 6, 7, and 8, respectively, along with ROI^{(1)} as a function of time. While the larger ROI tends to result in smaller RMSEs, the solution can be unstable when the observations are sparse as indicated by the missing red curve in Fig. 6. The optimal choice of ROI must lower the RMSE, while maintaining a stable solution. In Table 6 we show the temporal mean of RMSE from time step 1000 to time step 5000 for different number of observations and fixed/adaptive ROI. When *m* = 30, the temporal mean of RMSE for the green, blue, cyan, and black curve is 0.213, 0.2274, 0.4295, and 0.2296, respectively. We find that the ROI^{(1)} value becomes stable after a few steps of data assimilation and lies around 20.2, and the RMSEs of the black, blue, and green curves are indistinguishable. In this case the adaptive method almost gives the best ROI. On the other hand, we found that the resulting ROI is smaller when the observation density increases. This is likely because assimilating denser observations may cause the ensemble correlation to be more narrowly supported, which then leads to a smaller localization radius by this algorithm. Our speculation is based on Zhang et al. (2006, 729–730) and Daley and Menard (1993, see their Fig. 2), which is often called the “whitening” of the analysis error spectrum in that larger-scale error will be more effectively reduced first with more observations assimilated. In the meantime it is known that with denser observations larger localization radius can be applied to improve the mean update while also maintaining the stability of the solution. When *m* = 60, the temporal mean of the red, green, blue, cyan, and black curves is 0.1038, 0.1061, 0.1138, 0.1381, and 0.1112. The temporal mean of ROI^{(1)} in this case is 18.2209. When *m* = 120, the temporal mean of the red, green, blue, cyan, and black curves is 0.0652, 0.0689, 0.0718, 0.0854, and 0.0713 while the temporal mean of ROI^{(1)} is 16.5928. In Fig. 8, we see that the blue and black curves are nearly identical, because the temporal mean ROI^{(1)} is approximately 16. And from the graph we can easily see that the red curve is consistently better than the black curve implying that there is still room to reduce the RMSE by enlarging the ROI. This suggests that more work needs to be done to take observation density into account so the localization radius determined by this algorithm can be more efficient. On the other hand, comparing the graphs in the top panels we see that with more observations the RMSE reduces more quickly and it is hard to distinguish the curves for the first 50 time steps in Figs. 7 and 8.

#### 3) Curve of *F* value

The value of the cost function *F* for each observation in the *m* = 30 cases is plotted in the top of Fig. 9 as a function of ROI for time steps 2, 120, and 1020. The ensemble size *N* = 61. The red dots indicate where *F* is a minimum for each observation. In the bottom panels, the curves are the (rescaled) estimated probability density of all possible ROIs. The blue dot shows where the maximum likelihood is taken, hence it is also the final ROI^{(1)} we use for localization at each time step. The curves in the top panels are almost convex, making finding the minimum value (red dots) of *F* for each observation not a trivial problem. This is also the consequence of adding the penalty term to the cost function. The bottom panels show that the optimal ROI^{(1)} for each observation is distributed in a way that the probability density function of ROI^{(1)} only has one peak (i.e., the maximum likelihood estimate is clearly well defined). If the density function has two peaks, then it is not easy to determine a unique localization radius. In that case, one can try dividing observations into groups so for each group of observations the probability density function of ROI^{(1)} is better behaved. This aspect of the algorithm needs further exploration in the future.

## 5. Discussion and summary

In this article we first presented a cost function approach to analyze the sampling error issue in the case where the true covariance is known. Then we generalized this method to the case where the true covariance is not known. We presented a probabilistic approach to determine the localization radius adaptively when serial ensemble square root filters are used to assimilate uncorrelated observations. The advantage of this method is that it uses the information from the ensemble members and observations only within a single assimilation window. Further the computational cost of this algorithm is small and its performance in the Lorenz-96 system is promising. We compared the results from this probabilistic method with that from the deterministic method in the case that the true covariance is known. The results show that the probabilistic method gives more useful radius of influence for long time implementation than the deterministic method. As a side result, we find that in order to determine a good localization radius, one needs to consider more than simply sampling error. In particular, the dynamical property of the model needs to be taken into account.

There are severe issues worth further discussion and investigation. First of all, the algorithm under discussion does not utilize any information regarding the observation density. As a consequence, the ROI^{(1)} is not the optimal radius of influence when observations are dense. Second the algorithm in this article is based on serial ensemble square root Kalman filters and has the requirement that the observational operator *H* must be the restriction operator and that observations must be uncorrelated, which must be eliminated in the future work. Third the output of ROI^{(0)} in the experiments in Lorenz-96 system, where the true covariance is approximated by 6000 ensemble members, is clearly not optimal according to the curves in Fig. 8, suggesting again that in order to get a good localization radius other system dynamics besides sampling error need to be considered. On the other hand the choice of a noninformative prior and the use of an empirical penalty term are not necessarily to be optimal as presented in this article. Deriving a useful mathematical formula for other choices of prior is a challenging problem.

There are several ways to make modifications in the algorithm in order to make use of more statistical tools. For example, we can generalize the concept of maximum likelihood as the “peaks of the density function.” More precisely speaking, if the probability density function of ROI^{(1)} has more than one peak, it may be more wise to use the different peak values as the ROI for different observations rather than using only the peak value at the maximum likelihood estimate. One can also try incorporating this scheme with other methods. For example, one can consider taking observational impact into account to get a weighted maximum likelihood estimate. Future research is also warranted to extend the current one-dimensional study to 2D or 3D as well as for covariance across different physical variable (e.g., between temperature and winds).

## Acknowledgments

We thank Zhibiao Zhao for his valuable suggestion on the choice of Jeffreys prior; Ningtao Wang and Cheng You for their valuable suggestion about Wishart distribution; Jeffrey Anderson for his encouraging us to publish this result; John Harlim, Jonathan Poterjoy, and the anonymous reviewers for their patient and careful reading and generous suggestion on how to make the math more clear. This work was supported in part by Office of Naval Research Grant N000140910526 and the National Science Foundation Grant ATM-0840651.

### APPENDIX

#### Mathematical Derivation

##### a. Notation

For convenience the notation in this appendix is a little different from those in the main context. For example, ^{s} in the appendix is not sample covariance, but ; is actually _{Loc}; the true regression coefficient is not denoted by , but by *r*_{i}; and *n* is actually short for *n*_{Loc}.

##### b. Review of setup

We now consider how to determine the localization radius when assimilating a single observation. Since we only assimilate one observation each time, for the convenience of notation, we always assume the observation is on the first grid point. The observation operator **H** is a row vector in this case and more specifically **H** = (1, 0, …, …, 0).

Let be the regression coefficient for the *i*th variable when using the pure sample covariance matrix. Let *r*_{i} be the regression coefficient for the *i*th variable when using the true covariance matrix. Then the regression coefficient for the *i*th variable when using localized sample covariance matrix is where *ρ*_{i} = GC(ROI, *d*_{i}).

We do not know what the true covariance is hence we think of the true covariance as a random matrix. Let ^{s} be *N* − 1 times the sample covariance matrix. Let *p*( | ^{s}) be the probability density of the true covariance matrix given ^{s}. Then

Let be the observational covariance, **H** = (1, 0, …, 0).

Then

Let *n* = *n*_{Loc} be the number of state variables within distance ROI to the observation and *N* be the ensemble size. We require *N* > *n*_{Loc}.

We define functions *F*_{1} and *F*_{2}:

and

where

is a constant depending only on the sample and *n* such that

We want to minimize the following:

In this draft we use Jeffreys prior: .

The notations are

##### c. Mathematical derivation

###### 1) First we compute :

Let ^{s} = (^{s})^{−1} and = ^{−1}

[By Muirhead (1982) theorem 2.1.8, *d*^{−1} = det()^{−n−1}*d**.*]

[By Muirhead (1982) theorem 2.1.11.]

Hence,

lemma 1: .

###### 2) Compute

Let

where _{n−1} is the (*n* − 1) × (*n* − 1) identity matrix, 0 is the zero row vector, and **t** = (*t*_{2}, …, *t*_{n})^{T} is a column vector whose entries are defined by *t*_{i} =−(*b*_{1i}/*b*_{ii}) for 2 ≤ *i* ≤ *n*. We define *d* = *dt*_{1}, …, *dt*_{n} and *d***t** = *dt*_{2}, …, *dt*_{n}. Let be a (*n* − 1) × (*n* − 1) matrix whose entries are defined by = *b*_{i+1,j+1} − *b*_{1,i+1}*b*_{1,j+1}/*b*_{11}. Then it can be straightforwardly verified that

and we have

lemma 2: .

Because of space limitations we merely remark that this can be proven by careful but straightforward computation, leaving to the interested reader the task of filling in the details.

It is easy to see that . So

It is known that for *n* × *n* SPD matrix , *d*^{−1} = det()^{−(n+1)}*d* [see Muirhead (1982), theorem 2.1.8].

Therefore and

Now define . So

Let the ensemble perturbation be and define **δ**_{i} = **ϵ**_{i} + *t*_{i}**ϵ**_{1}. Then by direct computation one can get

Then

Let . Then

It follows from Muirhead (1982) theorem 2.1.11 that

Hence,

Let

and , , , for 2 ≤ *i* and *j* ≤ *n*.

Then

and *d* = *dt*_{1}*d***t**.

Hence, it suffices to compute the following:

Lemma 3: For 2*l* > *n* + 2 and even integer *n*,

Lemma 4: For 2*l* > *n* + 2, .

Lemma 5: Let ∈ ℝ^{n×n} denote the identity matrix and **s** ∈ ℝ^{n} be a column vector. Then det( + **ss**^{T}) = 1 + |**s**|^{2}.

Lemma 6:

where is the singular value decomposition, and (^{T})_{i} is the *i*th row of the matrix ^{T}.

Proof: Recall that and :

Let as before, then and :

where *a*_{3} is essentially a Gamma function. One can easily get the desired expression of *a*_{3} by substitution.

Let

then

Recall the following (well known) identities about multivariate gamma functions (assuming are both odd numbers):

It is not hard to check that

Hence (combining all the above identities), we have

lemma 7:

###### 3) Compute :

By similar derivation we can get

Hence, it suffices to compute

Lemma 8:

It is easy to see that by symmetry when *i* ≠ *j*

On the other hand,

Hence, one can derive

Lemma 9: For odd numbers *n*, 2*k*:

And similarly, we can derive

Lemma 10:

Combining all the results above, we have the following theorem:

Theorem 5.1:

where Δ_{ii} is the *i*th diagonal element of **Δ**.

It is not hard to see that

## REFERENCES

*Bayesian Theory.*John Wiley and Sons, 608 pp.

*Trans. ASME J. Fluids Eng.,*

**82D,**

*Predictability of Weather and Climate,*T. N. Palmer and R. Hagedorn, Eds., Cambridge University Press, 40–58.

*Aspects of Multivariate Statistical Theory.*John Wiley and Sons, 673 pp.

*Tellus,*

## Footnotes

This article is included in the Sixth WMO Data Assimilation Symposium Special Collection.