## 1. Introduction

The Kalman filter was first proposed by Kalman (1960) under the assumption of linear dynamics, knowledge of certain covariance information, and Gaussian random variables. Evensen (1994) first introduced the ensemble Kalman filter (EnKF) into this context. EnKF takes advantage of a short-term ensemble to estimate flow-dependent background error covariance, the computational cost of which is much more affordable than the original Kalman filter for large dimensional computation. As variations of EnKF, several ensemble square root filters (EnSRF) have been developed [for a summary see Tippett et al. (2003)] in order to maintain consistency between the ensemble and the covariance (see Houtekamer and Mitchell 2001; Anderson 2001; Bishop et al. 2001; Whitaker and Hamill 2002; Snyder and Zhang 2003). Ott et al. (2004) designed the local ensemble transform Kalman filter so the state variables can be updated completely in parallel. Meanwhile people are devoting efforts to combining EnKF with three- and four-dimensional variational data assimilation (3D/4D-Var; e.g., Hamill and Snyder 2000; Lorenc 2003; Wang et al. 2008a,b; Zhang et al. 2009, 2013). In the ensemble Kalman filters, sampling error is one of the common problems.

Since the ensemble size is usually far less than the state dimension in geoscience applications, it is impossible for the sample covariance calculated by the ensemble members to capture all the main properties of the true covariance. Some of the sampling errors can be eliminated when the physical correlation radius is small compared to the spatial dimension of the state, which is usually the case in numerical weather prediction. Houtekamer and Mitchell (2001) and Hamill et al. (2001) used localization to remove spurious correlations between distant variables. The localization method can be used in all ensemble-based filters and it has been shown to be a powerful tool for limiting sampling error due to a small ensemble size. The Gaspari–Cohn (GC) localization function (Gaspari and Cohn 1999) is symmetric positive definite and widely used for a wide range of applications. Alternative methods include Jun et al. (2011) who considered estimating the background covariance by smoothing the sample covariance using a kernel function. The GC function is effective for spatial localization, but the localization radius in most applications until recently is chosen empirically. Anderson (2012) sheds some new light on this problem by making use of a probabilistic approach. Anderson and Lei (2013) further extended the probabilistic approach to determining the localization radius based on observation impact. Lei and Anderson (2014) chose the localization radius to be the one that minimizes the difference between sample covariance and localized sample covariance computed from different samples. Meanwhile Bishop and Hodyss (2007) and Bishop et al. (2011) proposed the use of entry by entry powers of sample correlation together with a nonadaptive empirical localization function. The method proposed in the current study is based on a sequential ensemble square root filter. For covariance inflation we utilize the relaxation method from Zhang et al. (2004). To clarify, we present the algorithm in detail in Table 1. Moreover we would like to mention that the method under discussion in this manuscript only applies to the case when observations are uncorrelated. Further generalizations of this approach to correlated observation errors will be the subject of future research.

Algorithm of sequential localized ensemble square root Kalman filter: Algorithm AK.

The structure of this paper is as follows. In section 2 we propose a variational approach through minimizing a cost function to find the optimum localization radius in the case that the true covariance is known. In section 3 we generalize the method to the case where the true covariance is unknown but can be represented probabilistically from the sample covariance. Numerical results about the performance of this method are presented through sections 2, 3, and 4. In particular, section 4 contains the numerical results based on the Lorenz-96 system. Section 5 provides a summary and discussion on the proposed localization method.

## 2. Optimal localization with true covariance: A cost function approach

As mentioned earlier, when the physical correlation radius is small the sampling noise at far distances can be largely reduced by localization. Nevertheless, the benefits of this process can occur at the expense of removing small but nonnegligible true covariance beyond the cutoff distance and lowering the true covariance within the localization radius. Figure 1 is a visual illustration of the effect of covariance localization. In this experiment, we use the GC correlation function as the true covariance by setting the true radius of influence (ROI) equal to 5. We draw 61 sample vectors from the Gaussian distribution ^{true}) and compute the sample covariance. Using a localization radius of 24, the localized sampling covariance is then compared with the raw sampling covariance and the true covariance. We can clearly see that the localization method effectively reduces the spurious correlation at far distances.

### Cost function F_{0}

*F*

_{0}. Recall, when one observation

**y**is available, the updating formula of state variable

**x**

_{i}using the algorithm in section 1 is as follows:where

*i*th state variable, Δ

**y**is the innovation,

*ρ*

_{i}is the value of localization function for the

*i*th state variable and observation

**y**.

^{true}we can simply substitute

^{true}for

^{s}without doing localization. Hence, the exact Kalman updating formula becomeswhereis the true regression coefficient (perfect Kalman gain). Here

**h**is the observational operator and

*F*

_{0}. Now let us consider the case when we have more than one observation available at the same time. We can find the optimal ROI for each observation and then find the maximum likelihood estimate of the ROI and choose it to be our overall optimal ROI.

Based on the above analysis, we propose the algorithm A0 (Table 2) (while its results are shown in the left part of Table 3) for evaluating the optimal ROI. In algorithm A0, *j*th observation and *i*th grid point. Similarly *j*th observation and *i*th grid point; *d*_{ji} is the distance (in number of grid points) between the *j*th observation and *i*th grid point. In the left half of Table 3 and the left panel of Fig. 2, we show the value of ROI^{(0)} in terms of grid points for different true covariances and ensemble sizes. Given true covariance of GC form with correlation radius ROI = 2, 5, 10, or 20, we draw samples of size *N* = 11, 21, …, 121 from the distribution ^{true}) and compute the sample covariance ^{s}. We assume observations exist at every grid point and have an error variance of 0.04. Then we compute the ROI^{(0)} using algorithm A0 for each pair of ROI and *N* values. We do this test 1000 times for each pair of parameters and estimate the probability density of ROI^{(0)} and plot the curves of (rescaled) probability density in Fig. 2. In Table 3 (left half) we show the maximum likelihood estimate of ROI^{(0)} for each choice of ensemble size and radius of influence of the true covariance.

Algorithm A0.

Comparison of *F*_{0} and *F* for given true covariance. We do 1000 tests for each experiment and estimate the probability density of ROI^{(1)} for each experiment. We plot the (rescaled) probability density function of ROI^{(1)} in the right panels of Fig. 2 and list the maximum likelihood estimates of ROI^{(1)} in the right half of Table 5. ROI_{max} is the maximal ROI allowed by algorithm A0 and algorithm A1, respectively. At the first glance we see that for some set of parameters the difference between ROI^{(0)} and ROI^{(1)} is large. But we cannot expect that algorithm A0 and algorithm A1 give close values in all situations because in algorithm A1 the information of true covariance is completely unknown anyway. On the other hand, we think the value of ROI^{(1)} in these cases are still acceptable and sometimes it is close to ROI^{(0)}. Moreover we see that when ensemble size is small ROI^{(1)} is also small though the true covariance has larger ROI. This is intuitively right since when ensemble size is smaller we put fewer trust on the ensemble covariance. We also see that for larger ensemble size, both algorithms results in larger ROI.

For all choices of parameters in this experiment, the resulting maximum likelihood estimate of ROI^{(0)} is always larger than the true ROI. This is desirable since our goal is to find a localization that can reduce the spurious correlation at a far distance while also approximating the true correlation at a near distance. Second, we can see that for a larger ensemble size (or ROI) the resulting ROI^{(0)} also gets larger. We also find that as the true ROI increases for a fixed ensemble size, the increment of ROI^{(0)} is almost linearly proportional to the increment of ROI. That is likely due to the Gaussian assumption. More discussions on Table 3 and Fig. 2 will be presented in section 3.

## 3. A probabilistic method of localization

In the above formula *p*(^{s}) is a constant for a given sample and *p*(^{s} | *p*(

*F*over all true (random) covariance is

^{s}is not an open subset of the space of all symmetric positive definite (SPD) matrices. In this case, the space of singular semipositive definite matrix is a surface of the full space of SPD matrix. And the probability distribution lies on this surface. But the true covariance is usually of full rank, hence, it does not lie in this subspace. Though people have studied the analog of the Wishart distribution in the case that

^{s}is of deficient rank, we do not use it for the following two reasons:

- the true covariance may not lie in this subspace and the subspace is not flat;
- mathematically it is harder to simplify the cost function to a computable form.

_{Loc}) is the sample (or true) covariance of the state variables that are within distance ROI from the observation. ROI can only be chosen among those that the amount of state variables within the distance ROI from an observation is strictly less than the ensemble size

*N*. Intuitively this truncated average only takes into account the influence of the observation on the state variables that lie in the “radius of influence .” This restriction also causes the resulting ROI to be extremely small if we directly apply this method to 3D GCMs. This aspect will be studied further in the future.

### a. How to choose the cost function

*F*

_{0}for the truncated average as in section 2, where the true covariance is known, we find that the variational method often leads to a trivial solution (ROI = 0, not shown). We believe part of the reason is that the truncation ignores all information outside of the radius of influence. In other words, the localization radius resulted by using the function

*F*

_{0}in Eqs. (5) and (6) might be much different if we instead use a truncated version of

*F*

_{0}. To limit the loss of information, we consider a modified cost function:

The advantages of the function *F*_{11} are the following:

- The truncated
*F*_{11}and the full*F*_{11}have the same value: - In the case the true covariance is known,
*F*_{0}and*F*_{11}only differ by a constant. More precisely speaking,where this constant*C*has no impact on where the cost function’s minimum is taken in this case. Therefore, the constant*C*has no impact on determining the localization radius.

*F*

_{11}for

*F*

_{0}in algorithm A0 is mathematically equivalent. More importantly, in the case that the true covariance is unknown, the truncated average when use Jeffreys prior can be simplified explicitly so the integration of thousands of variables can be avoided. However, because of the use of the uninformative Jeffreys prior as the true covariance probability distribution function, we find that an additional penalty is necessary to be included in the cost function to avoid a trivial solution where the resulting ROI is simply the largest possible radius of influence allowed by the algorithm. Therefore, we definewhere the second term is a penalty term that can be interpreted as the sampling noise or stability term that accounts for the difference between the true covariance and sampling covariance scaled by the localization coefficient.

*ρ*

_{i}= GC(ROI,

*d*

_{i}).

For the exact mathematical expression/definition of *d*_{Loc}, please refer to the appendix. Note that this integral is taken over thousands of variables so it cannot be directly computed numerically. Mathematical simplifications to Eq. (11) can be made under certain assumptions. A computable form of this expression is derived in the appendix.

#### Theorem 3.1

Suppose there is a scalar observation available at this time and the observational variance is *.* The observational operator **h** = (0, 0, …, 0, 1, 0, …, 0). Let *k* be the state variable that the observation is taken for*.* Let *n*_{Loc} be the number of state variables that are within distance ROI from the observation. Let *ϵ*_{i} ∈ ℝ^{1×N} be the forecast perturbation of the *i*th component of the state variable*.*

*f*(

*x*) when

*x*follows the distribution

Based on the above theorem we propose algorithm A1 (Table 4) to adaptively determine the localization radius. The computational cost of this algorithm is *O*(*N*^{2}*m*).

Algorithm A1.

### b. Comparison of *ROI*^{(0)} and *ROI*^{(1)} for a given true covariance

We present the experimental results in Table 3 and Fig. 2 for a comparison of ROI using the algorithm A0 presented in section 2 [ROI^{(0)}] versus the new probabilistic algorithm A1 [ROI^{(1)}]. The experiment for ROI^{(0)} is still the same as that presented in section 2, where the true covariance is known. In addition, we add columns 7–11, which are values of ROI^{(1)} computed by algorithm A1 for the same true covariance and sample covariance that are used in columns 2–6, though the information of true covariance is completely unknown in algorithm A1. Again we do 1000 tests for each experiment and estimate the probability density of ROI^{(1)} for each experiment. We plot the (rescaled) probability density function of ROI^{(1)} on the right panels of Fig. 2 and list the maximum likelihood estimates (MLE) of ROI^{(1)} in the right panel of Table 3. Note that the last row of Table 3 is the maximum value of ROI^{(0)} and ROI^{(1)} allowed by algorithm A0 and A1 respectively. The upper bound for ROI^{(0)} can be modified arbitrarily with the larger upper bound translating into a larger range of possible solutions. ROI_{max} = 130 is found to be a large enough searching range for our application. The upper bound for ROI^{(1)} is not as flexible as in the ROI^{(0)} case. In this experiment and the experiments in the next section, the maximum of ROI^{(1)} is chosen to be ^{(1)} tends to be large, the ROI^{(1)} of the 1000 independent tests would simply concentrate at the maximum value allowed by the algorithm. This is why we see some peculiar shape of the black curves in the right panels of Fig. 2 when the ensemble size is small.

When the true ROI = 2, ROI^{(0)} varies from 2.5 to 5. For the algorithm that uses no knowledge of the true covariance, ROI^{(1)} changes from 2.2 to 17, which has a much larger range than ROI^{(0)}. For example, when *N* = 11, the MLE of ROI^{(0)} is about 1.3 times the true ROI. When *N* = 121, the MLE of ROI^{(0)} is about 2.6 times the true ROI. Note that now there is no similar linear relationship between the MLE and ROI for ROI^{(1)} [which is found for ROI^{(0)}]. When *N* = 11, ROI = 10 and 20, the probability density of ROI^{(1)} all concentrate at a single point, which causes the graph of the (rescaled) probability density function to not be seen in Fig. 2. Similarly for *N* = 21 and 31, the probability density of ROI^{(1)} in Fig. 2 is very concentrated near the maximal possible ROI (i.e., ROI_{max}). This is because algorithm A1 does not allow ROI^{(1)} to be larger than _{max}. When *N* = 121, the distribution of ROI^{(1)} is far away from the upper bound ROI_{max} but the MLE of ROI^{(1)} does not depend linearly on the true ROI neither.

At first glance, the differences between ROI^{(0)} and ROI^{(1)} are quite large over the set of parameters that we examine. When *N* = 121, the MLE of ROI^{(1)} is larger than that of ROI^{(0)} when the true ROI= 2, 5, and 10 but smaller than that of ROI^{(0)} when the true ROI = 20. This result is not surprising because in algorithm A1 the true covariance is completely unknown as in real-data applications. On the other hand, the values of ROI^{(1)} are close to ROI^{(0)} when *N* = 61. Moreover we see that when the ensemble size is small ROI^{(1)} is also small though the true covariance has larger ROI. We also see that for a larger ensemble size, both algorithms result in larger ROIs. Therefore, the algorithm adjusts the ROI according to the expected sampling error for a given data assimilation cycle.

Although *F*_{11} would give the same value if we slightly enlarge the region allowed by the Wishart distribution, the integral would be different due to the use of the noninformative Jeffreys prior. As a consequence, if we only use *F*_{11} as our cost function the resulting ROI^{(1)} would just be ROI_{max} in most situations. This is why the second term in *F*_{10} is required for making the solution nontrivial. On the other hand, the choice of prior does not necessarily have to be the Jeffreys prior. The penalty term can have other choices as well. In this sense, the method is still quite empirical when the true covariance is unknown. Future studies will explore the use of more informative prior probability distribution functions for optimizing the ensemble localization distance.

## 4. Application to the ensemble covariance generated by the Lorenz-96 model

### a. Experiments with the Lorenz-96 system and approximated true covariance

Now we design experiments to compare algorithms A0 and A1 in a Lorenz-96 system (Lorenz 2006). Here the system is configured to have 120 variables and 30 uniformly distributed observations, which lie on the model grid points. The external forcing *F* is set to 8, the time step *dt* is set to 0.05, and observations appear every two time steps. We use an error variance of 0.04 for all observations in these experiments. Observational operator **H** is simply the restriction operator as required by the current version of the algorithm. To get an approximated true covariance in the Lorenz-96 system, we first run 6000 ensemble members for *T*_{1} time steps with the EnKF data assimilation. We then run the ensemble members for *T*_{2} time steps without data assimilation. Then we use the whole 6000 ensemble members to get an approximation of the true covariance ^{true}. In the first *T*_{1} time steps, we use covariance relaxation with *α* = 0.5 (Zhang et al. 2004). Then, we draw *N* ensemble members from the 6000 members and compute the sample covariance ^{s}. Finally we compute the optimal ROI^{(0)} using algorithm A0 with the known (approximated) true covariance, and compare it with the optimal ROI^{(1)} that was computed by algorithm A1 with no knowledge of the true covariance. We choose *α* = 0.5 here because the model still has a chance to blow up when no localization is used, even if when the ensemble size is 6000. We did 1000 tests for each set of the parameters and we estimated the probability density of the optimum ROI with respect to both algorithm A0 and A1. The graphs of probability density function are shown in Figs. 3 and 4 and the maximum likelihood ROI^{(0)} and ROI^{(1)} are listed in Table 5.

Comparison of *F*_{0} and *F* for approximated, true covariance in Lorenz-96 system. These are maximum likelihood estimates of ROI^{(0)} and ROI^{(1)}. The probability estimates are based on 1000 independent tests. Our true covariance ^{true} is approximated by the sample covariance of 6000 ensemble members. First we run these 6000 members for *T*_{1} time steps with data assimilation. Then we run these members for *T*_{2} time steps without data assimilation. Then we compute the sample covariance and set it to be the true covariance ^{true}. For each test our sample covariance ^{s} is computed from *N* randomly drawn ensemble members. Then we compute ROI^{(0)} and ROI^{(1)} based on that specific sample covariance and true covariance. The left half are maximum likelihood estimates of ROI^{(0)}. The right half are maximum likelihood estimates of ROI^{(1)}. ROI_{max} is the maximal ROI allowed by algorithm A0 and algorithm A1.

We do not expect that algorithms A1 and A0 give similar values in all cases, because in algorithm A1 we do not use the true covariance while in algorithm A0 we have the complete information of true covariance.

When *T*_{1} = 50 (*T*_{1} is the EnKF analysis duration), *N* = 61, the value of ROI^{(0)} = 17.7, 18.6, and 10.6 for different *T*_{2} (*T*_{2} is the length of free ensemble forecast without assimilation). We see ROI^{(1)} = 19.4, 19.5, and 11 for different *T*_{2}, respectively, which are very close to the values of ROI^{(0)}. But for *T*_{1} = 500 (Fig. 4), we see the probability density function of ROI^{(0)} is quite peculiar for *T*_{2} = 1 and 10 (blue and red curves in the left panels) when the ensemble size is small. The maximum likelihood of ROI^{(0)} are around 7.7, 2.5, and 8 for different *T*_{2} no matter if *N* = 11, 21, or 31. Although it is not clear why *T*_{2} = 10 results in ROI^{(0)} ≈ 2.5, which might be due to the system dynamics, this shows that ROI^{(0)} only changes a little when ensemble size changes among *N* = 11, 21, and 31. The estimated probability densities of ROI^{(0)} for these parameters have similar shape as can be seen in first three panels on the left of Fig. 4. Now, if we compare the graphs on all of the left panels of Figs. 2, 3, and 4, we see that although the (approximate) true covariance is known in all of these cases, there is a difference in the shapes of the probability distributions for ROI^{(0)}. It can be clearly seen that in the experiments for the left panels of Fig. 2 the probability density function curves are approximately symmetric about their axis of symmetry, in which case the sample is exactly drawn from a Gaussian distribution. As a comparison, we see that for tests using small ensemble sizes, which is for the left panels of Figs. 3 and 4, the curves do not have symmetry at all. This is probably because the distributions of state variables in the Lorenz-96 system are not close to Gaussian. For *N* = 21, *T*_{1} = 50, *T*_{2} = 10 (red curve in Fig. 3, *N* = 21), the curve even has two peaks. In this case it would be hard for us to determine a good localization radius even if we know the true covariance. For *T*_{1} = 500, *T*_{2} = 100 (the black curves in Fig. 4), the behavior of ROI^{(0)} becomes close to that when *T*_{1} = 50, *T*_{2} = 100 for all ensemble sizes. This might be because after running the model for 100 time steps without data assimilation, the distribution of state variables become close to the “climatological distribution.” But as we compare the results in this case with the numerical results in section 4b, we see that ROI around 11 does not give optimal RMSE for the long time real implementation. However, the curves on the right panels of Figs. 3 and 4 are more symmetric. This might because our probabilistic method still assumes Gaussian distribution and we do see in the cases *T*_{1} = 50 or 500, *T*_{2} = 1 or 10, ROI^{(1)} would result in smaller RMSE according to numerical results in section 4b, hence, it is a better value for real-data implementation.

### b. Experiments that use algorithm A1 to do adaptive localization in Lorenz-96 system

Now we implement algorithm A1 in the Lorenz-96 system. As mentioned before, the Lorenz-96 system configured here has *n* = 120 state variables and is integrated using the Runge–Kutta fourth-order scheme. The observations are uniformly distributed and lie on some of the grid points. The number of observations *m* and ensemble size *N* are different for each case. To handle filter divergence, we use a relaxation coefficient *α* = 0.5 (Zhang et al. 2004). The inflation method is necessary here to prevent the model from blowing up. The choice of *α* = 0.5 is not tuned. We also did the same experiment for *α* = 0.25 and the results are similar. The choice of *α* can slightly influence the resulting ROI^{(1)}, but the change is ignorable at least for this experiment, so we only show the results for *α* = 0.5. At every time step when observations are available, we first compute ROI^{(1)} using algorithm A1, then we use ROI^{(1)} to be the localization radius before assimilating the observations.

#### 1) Sensitivity to ensemble size

In Fig. 5 we plot the curves of ROI^{(1)} as a function of time for different ensemble sizes. In general we see that the resulting localization radius increases with the ensemble size, which is consistent with the results in section 2 where the true covariance is known.

#### 2) Comparison of RMSE for different observational density

In this subsection, we fix the ensemble size at *N* = 61 and plot the spatial RMSE of the ensemble mean as a function of time for times steps between 1 and 50, 1000 and 1200, and 1200 and 1400. These values are plotted for *m* = 30, 60, and 120, in Figs. 6, 7, and 8, respectively, along with ROI^{(1)} as a function of time. While the larger ROI tends to result in smaller RMSEs, the solution can be unstable when the observations are sparse as indicated by the missing red curve in Fig. 6. The optimal choice of ROI must lower the RMSE, while maintaining a stable solution. In Table 6 we show the temporal mean of RMSE from time step 1000 to time step 5000 for different number of observations and fixed/adaptive ROI. When *m* = 30, the temporal mean of RMSE for the green, blue, cyan, and black curve is 0.213, 0.2274, 0.4295, and 0.2296, respectively. We find that the ROI^{(1)} value becomes stable after a few steps of data assimilation and lies around 20.2, and the RMSEs of the black, blue, and green curves are indistinguishable. In this case the adaptive method almost gives the best ROI. On the other hand, we found that the resulting ROI is smaller when the observation density increases. This is likely because assimilating denser observations may cause the ensemble correlation to be more narrowly supported, which then leads to a smaller localization radius by this algorithm. Our speculation is based on Zhang et al. (2006, 729–730) and Daley and Menard (1993, see their Fig. 2), which is often called the “whitening” of the analysis error spectrum in that larger-scale error will be more effectively reduced first with more observations assimilated. In the meantime it is known that with denser observations larger localization radius can be applied to improve the mean update while also maintaining the stability of the solution. When *m* = 60, the temporal mean of the red, green, blue, cyan, and black curves is 0.1038, 0.1061, 0.1138, 0.1381, and 0.1112. The temporal mean of ROI^{(1)} in this case is 18.2209. When *m* = 120, the temporal mean of the red, green, blue, cyan, and black curves is 0.0652, 0.0689, 0.0718, 0.0854, and 0.0713 while the temporal mean of ROI^{(1)} is 16.5928. In Fig. 8, we see that the blue and black curves are nearly identical, because the temporal mean ROI^{(1)} is approximately 16. And from the graph we can easily see that the red curve is consistently better than the black curve implying that there is still room to reduce the RMSE by enlarging the ROI. This suggests that more work needs to be done to take observation density into account so the localization radius determined by this algorithm can be more efficient. On the other hand, comparing the graphs in the top panels we see that with more observations the RMSE reduces more quickly and it is hard to distinguish the curves for the first 50 time steps in Figs. 7 and 8.

The temporal RMSE of the ensemble mean from time steps 1000 to 5000 in the Lorenz-96 system for different observation densities and fixed/adaptive ROI. NaN refers to missing data caused by frequent model breakdown.

#### 3) Curve of *F* value

The value of the cost function *F* for each observation in the *m* = 30 cases is plotted in the top of Fig. 9 as a function of ROI for time steps 2, 120, and 1020. The ensemble size *N* = 61. The red dots indicate where *F* is a minimum for each observation. In the bottom panels, the curves are the (rescaled) estimated probability density of all possible ROIs. The blue dot shows where the maximum likelihood is taken, hence it is also the final ROI^{(1)} we use for localization at each time step. The curves in the top panels are almost convex, making finding the minimum value (red dots) of *F* for each observation not a trivial problem. This is also the consequence of adding the penalty term to the cost function. The bottom panels show that the optimal ROI^{(1)} for each observation is distributed in a way that the probability density function of ROI^{(1)} only has one peak (i.e., the maximum likelihood estimate is clearly well defined). If the density function has two peaks, then it is not easy to determine a unique localization radius. In that case, one can try dividing observations into groups so for each group of observations the probability density function of ROI^{(1)} is better behaved. This aspect of the algorithm needs further exploration in the future.

## 5. Discussion and summary

In this article we first presented a cost function approach to analyze the sampling error issue in the case where the true covariance is known. Then we generalized this method to the case where the true covariance is not known. We presented a probabilistic approach to determine the localization radius adaptively when serial ensemble square root filters are used to assimilate uncorrelated observations. The advantage of this method is that it uses the information from the ensemble members and observations only within a single assimilation window. Further the computational cost of this algorithm is small and its performance in the Lorenz-96 system is promising. We compared the results from this probabilistic method with that from the deterministic method in the case that the true covariance is known. The results show that the probabilistic method gives more useful radius of influence for long time implementation than the deterministic method. As a side result, we find that in order to determine a good localization radius, one needs to consider more than simply sampling error. In particular, the dynamical property of the model needs to be taken into account.

There are severe issues worth further discussion and investigation. First of all, the algorithm under discussion does not utilize any information regarding the observation density. As a consequence, the ROI^{(1)} is not the optimal radius of influence when observations are dense. Second the algorithm in this article is based on serial ensemble square root Kalman filters and has the requirement that the observational operator *H* must be the restriction operator and that observations must be uncorrelated, which must be eliminated in the future work. Third the output of ROI^{(0)} in the experiments in Lorenz-96 system, where the true covariance is approximated by 6000 ensemble members, is clearly not optimal according to the curves in Fig. 8, suggesting again that in order to get a good localization radius other system dynamics besides sampling error need to be considered. On the other hand the choice of a noninformative prior and the use of an empirical penalty term are not necessarily to be optimal as presented in this article. Deriving a useful mathematical formula for other choices of prior is a challenging problem.

There are several ways to make modifications in the algorithm in order to make use of more statistical tools. For example, we can generalize the concept of maximum likelihood as the “peaks of the density function.” More precisely speaking, if the probability density function of ROI^{(1)} has more than one peak, it may be more wise to use the different peak values as the ROI for different observations rather than using only the peak value at the maximum likelihood estimate. One can also try incorporating this scheme with other methods. For example, one can consider taking observational impact into account to get a weighted maximum likelihood estimate. Future research is also warranted to extend the current one-dimensional study to 2D or 3D as well as for covariance across different physical variable (e.g., between temperature and winds).

We thank Zhibiao Zhao for his valuable suggestion on the choice of Jeffreys prior; Ningtao Wang and Cheng You for their valuable suggestion about Wishart distribution; Jeffrey Anderson for his encouraging us to publish this result; John Harlim, Jonathan Poterjoy, and the anonymous reviewers for their patient and careful reading and generous suggestion on how to make the math more clear. This work was supported in part by Office of Naval Research Grant N000140910526 and the National Science Foundation Grant ATM-0840651.

# APPENDIX

## Mathematical Derivation

### a. Notation

For convenience the notation in this appendix is a little different from those in the main context. For example, ^{s} in the appendix is not sample covariance, but _{Loc}; the true regression coefficient is not denoted by *r*_{i}; and *n* is actually short for *n*_{Loc}.

### b. Review of setup

We now consider how to determine the localization radius when assimilating a single observation. Since we only assimilate one observation each time, for the convenience of notation, we always assume the observation is on the first grid point. The observation operator **H** is a row vector in this case and more specifically **H** = (1, 0, …, …, 0).

Let *i*th variable when using the pure sample covariance matrix. Let *r*_{i} be the regression coefficient for the *i*th variable when using the true covariance matrix. Then the regression coefficient for the *i*th variable when using localized sample covariance matrix is *ρ*_{i} = GC(ROI, *d*_{i}).

^{s}be

*N*− 1 times the sample covariance matrix. Let

*p*(

^{s}) be the probability density of the true covariance matrix given

^{s}. ThenLet

**H**= (1, 0, …, 0).

Let *n* = *n*_{Loc} be the number of state variables within distance ROI to the observation and *N* be the ensemble size. We require *N* > *n*_{Loc}.

*F*

_{1}and

*F*

_{2}:andwhereis a constant depending only on the sample and

*n*such that

### c. Mathematical derivation

#### 1) First we compute :

^{s}= (

^{s})

^{−1}and

^{−1}

Hence,

lemma 1:

#### 2) Compute

_{n−1}is the (

*n*− 1) × (

*n*− 1) identity matrix, 0 is the zero row vector, and

**t**= (

*t*

_{2}, …,

*t*

_{n})

^{T}is a column vector whose entries are defined by

*t*

_{i}=−(

*b*

_{1i}/

*b*

_{ii}) for 2 ≤

*i*≤

*n*. We define

*d*

*dt*

_{1}, …,

*dt*

_{n}and

*d*

**t**=

*dt*

_{2}, …,

*dt*

_{n}. Let

*n*− 1) × (

*n*− 1) matrix whose entries are defined by

*b*

_{i+1,j+1}−

*b*

_{1,i+1}

*b*

_{1,j+1}/

*b*

_{11}. Then it can be straightforwardly verified thatand we have

lemma 2:

Because of space limitations we merely remark that this can be proven by careful but straightforward computation, leaving to the interested reader the task of filling in the details.

It is known that for *n* × *n* SPD matrix *d*^{−1} = det(^{−(n+1)}*d*

Therefore

**δ**_{i}=

**ϵ**_{i}+

*t*

_{i}

**ϵ**_{1}. Then by direct computation one can get

*i*and

*j*≤

*n*.

*d*

*dt*

_{1}

*d*

**t**.

*l*>

*n*+ 2 and even integer

*n*,

Lemma 4: For 2*l* > *n* + 2,

Lemma 5: Let ^{n×n} denote the identity matrix and **s** ∈ ℝ^{n} be a column vector. Then det(**ss**^{T}) = 1 + |**s**|^{2}.

^{T})

_{i}is the

*i*th row of the matrix

^{T}.

*a*

_{3}is essentially a Gamma function. One can easily get the desired expression of

*a*

_{3}by substitution.

Hence (combining all the above identities), we have

#### 3) Compute :

*i*≠

*j*

*n*, 2

*k*:And similarly, we can derive

Combining all the results above, we have the following theorem:

_{ii}is the

*i*th diagonal element of

**Δ**.

And by doing the same substitution as in the computation of *a*_{3} in the proof of lemma 7, it is not hard to see the equivalence between Eqs. (A5), (A6) and Eqs. (4), (5).

## REFERENCES

Anderson, J. L., 2001: An ensemble adjustment Kalman filter for data assimilation.

,*Mon. Wea. Rev.***129**, 2884–2903, doi:10.1175/1520-0493(2001)129<2884:AEAKFF>2.0.CO;2.Anderson, J. L., 2012: Localization and sampling error correction in ensemble Kalman filter data assimilation.

,*Mon. Wea. Rev.***140**, 2359–2371, doi:10.1175/MWR-D-11-00013.1.Anderson, J. L., , and L. Lei, 2013: Empirical localization of obervation impact in ensemble Kalman filters.

,*Mon. Wea. Rev.***141**, 4140–4153, doi:10.1175/MWR-D-12-00330.1.Bernardo, J. M., , and A. F. Smith, 1994:

*Bayesian Theory.*John Wiley and Sons, 608 pp.Bishop, C. H., , and D. Hodyss, 2007: Flow adaptive moderation of spurious ensemble correlations and its use in ensemble-based data assimilation.

,*Quart. J. Roy. Meteor. Soc.***133**, 2029–2044, doi:10.1002/qj.169.Bishop, C. H., , B. J. Etherton, , and S. J. Majumdar, 2001: Adaptive sampling with the ensemble transform Kalman filter. Part I: Theoretical aspects.

,*Mon. Wea. Rev.***129**, 420–436, doi:10.1175/1520-0493(2001)129<0420:ASWTET>2.0.CO;2.Bishop, C. H., , D. Hodyss, , P. Steinle, , H. Sims, , A. M. Clayton, , A. C. Lorenc, , D. M. Barker, , and M. Buehner, 2011: Efficient ensemble covariance localization in variational data assimilation.

,*Mon. Wea. Rev.***139**, 573–580, doi:10.1175/2010MWR3405.1.Daley, R., , and R. Menard, 1993: Spectral characteristics of Kalman filter systems for atmospheric data assimilation.

,*Mon. Wea. Rev.***121**, 1554–1565, doi:10.1175/1520-0493(1993)121<1554:SCOKFS>2.0.CO;2.Evensen, G., 1994: Sequential data assimilation with nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics.

,*J. Geophys. Res.***99**, 10 143–10 162, doi:10.1029/94JC00572.Gaspari, G., , and S. E. Cohn, 1999: Construction of correlation functions in two and three dimensions.

,*Quart. J. Roy. Meteor. Soc.***125**, 723–757, doi:10.1002/qj.49712555417.Hamill, T. M., , and C. Snyder, 2000: A hybrid ensemble Kalman filter–3D variational analysis scheme.

,*Mon. Wea. Rev.***128**, 2905–2919, doi:10.1175/1520-0493(2000)128<2905:AHEKFV>2.0.CO;2.Hamill, T. M., , J. S. Whitaker, , and C. Snyder, 2001: Distance-dependent filtering of background error covariance estimates in an ensemble Kalman filter.

,*Mon. Wea. Rev.***129**, 2776–2790, doi:10.1175/1520-0493(2001)129<2776:DDFOBE>2.0.CO;2.Houtekamer, P. L., , and H. L. Mitchell, 2001: A sequential ensemble Kalman filter for atmospheric data assimilation.

,*Mon. Wea. Rev.***129**, 123–137, doi:10.1175/1520-0493(2001)129<0123:ASEKFF>2.0.CO;2.Jun, M., , I. Szunyogh, , M. G. Genton, , F. Zhang, , and C. H. Bishop, 2011: A statistical investigation of sensitivity of ensemble-based Kalman filters to covariance filtering.

,*Mon. Wea. Rev.***139**, 3036–3051, doi:10.1175/2011MWR3577.1.Kalman, R. E., 1960: A new approach to linear filtering and prediction problems.

*Trans. ASME J. Fluids Eng.,***82D,**35–45, doi:10.1115/1.3662552.Lei, L., , and J. Anderson, 2014: Comparisons of empirical localization techniques for serial ensemble Kalman filters in a simple atmospheric general circulation model.

,*Mon. Wea. Rev.***142**, 739–754, doi:10.1175/MWR-D-13-00152.1.Lorenc, A., 2003: The potential of the ensemble Kalman filter for NWP—A comparison with 4D-Var.

,*Quart. J. Roy. Meteor. Soc.***129**, 3183–3203, doi:10.1256/qj.02.132.Lorenz, E. N., 2006: Predictability—A problem partly solved.

*Predictability of Weather and Climate,*T. N. Palmer and R. Hagedorn, Eds., Cambridge University Press, 40–58.Muirhead, R. J., 1982:

*Aspects of Multivariate Statistical Theory.*John Wiley and Sons, 673 pp.Ott, E., and Coauthors, 2004: A local ensemble Kalman filter for atmospheric data assimilation.

*Tellus,***56A**, 415–428, doi:10.1111/j.1600-0870.2004.00076.x.Snyder, C., , and F. Zhang, 2003: Assimilation of simulated Doppler radar observations with an ensemble Kalman filter.

,*Mon. Wea. Rev.***131**, 1663–1677, doi:10.1175//2555.1.Tippett, M. K., , J. L. Anderson, , C. H. Bishop, , T. M. Hamill, , and J. S. Whitaker, 2003: Ensemble square root filters.

,*Mon. Wea. Rev.***131**, 1485–1490, doi:10.1175/1520-0493(2003)131<1485:ESRF>2.0.CO;2.Wang, X., , D. M. Barker, , C. Snyder, , and T. M. Hamill, 2008a: A hybrid ETKF–3DVar data assimilation scheme for the WRF model. Part I: Observing system simulation experiment.

,*Mon. Wea. Rev.***136**, 5116–5131, doi:10.1175/2008MWR2444.1.Wang, X., , D. M. Barker, , C. Snyder, , and T. M. Hamill, 2008b: A hybrid ETKF–3DVar data assimilation scheme for the WRF model. Part II: Real observation experiments.

,*Mon. Wea. Rev.***136**, 5132–5147, doi:10.1175/2008MWR2445.1.Whitaker, J. S., , and T. M. Hamill, 2002: Ensemble data assimilation without perturbed observations.

,*Mon. Wea. Rev.***130**, 1913–1924, doi:10.1175/1520-0493(2002)130<1913:EDAWPO>2.0.CO;2.Zhang, F., , C. Snyder, , and J. Sun, 2004: Impacts of initial estimate and observation availability on convective-scale data assimilation with an ensemble Kalman filter.

,*Mon. Wea. Rev.***132**, 1238–1253, doi:10.1175/1520-0493(2004)132<1238:IOIEAO>2.0.CO;2.Zhang, F., , Z. Meng, , and A. Aksoy, 2006: Tests of an ensemble Kalman filter for mesoscale and regional-scale data assimilation. Part I: Perfect model experiments.

,*Mon. Wea. Rev.***134**, 722–736, doi:10.1175/MWR3101.1.Zhang, F., , M. Zhang, , and J. A. Hansen, 2009: Coupling ensemble Kalman filter with four-dimensional variational data assimilation.

,*Adv. Atmos. Sci.***26**, 1–8, doi:10.1007/s00376-009-0001-8.Zhang, F., , M. Zhang, , and J. Poterjoy, 2013: E3DVar: Coupling an ensemble Kalman filter with three-dimensional variational data assimilation in a limited-area weather prediction model and comparison to E4DVar.

,*Mon. Wea. Rev.***141**, 900–917, doi:10.1175/MWR-D-12-00075.1.