## Abstract

The ensemble square root Kalman filter (ESRF) is a variant of the ensemble Kalman filter used with deterministic observations that includes a matrix square root to account for the uncertainty of the unperturbed ensemble observations. Because of the difficulties in solving this equation, a serial approach is often used where observations are assimilated sequentially one after another. As previously demonstrated, in implementations to date the serial approach for the ESRF is suboptimal when used in conjunction with covariance localization, as the Schur product used in the localization does not commute with assimilation. In this work, a new algorithm is presented for the direct solution of the ESRF equations based on finding the eigenvalues and eigenvectors of a sparse, square, and symmetric positive semidefinite matrix with dimensions of the number of observations to be assimilated. This is amenable to direct computation using dedicated, massively parallel, and mature libraries. These libraries make it relatively simple to assemble and compute the observation principal components and to solve the ESRF without using the serial approach. They also provide the eigenspectrum of the forward observation covariance matrix. The parallel direct approach described in this paper neglects the near-zero eigenvalues, which regularizes the ESRF problem. Numerical results show this approach is a highly scalable parallel method.

## 1. Introduction

Modern data assimilation methods for numerical weather prediction are called upon to merge up to the order of observations (Lahoz and Schneider 2014) of the surface and atmosphere with model forecasts consisting of the order of or more state variables to improve initial conditions and to produce probability estimates. As the size of this problem continues to increase, especially as remotely sensed satellite data become increasingly prevalent, it is essential to develop scalable methods that can handle increasing problem sizes in efficient and accurate ways.

In this paper we will explore a new method for directly solving the square root ensemble Kalman filter (EnKF) equations without using a serial approximation or perturbed observations. This implementation is scalable across many processing elements and takes advantage of recent improvements in both theoretical and computational linear algebra. It avoids a major problem that occurs with the serial approximation to the square root EnKF when used in the presence of covariance localization—namely, that the ordering of the observations becomes important, since the covariance localization operation does not commute (Nerger 2015; Bishop et al. 2015). While Bishop et al. (2015) introduced the consistent hybrid ensemble filter (CHEF), which provides consistent serial observation processing in a non–square root EnKF framework by using perturbed observations, this manuscript addresses how to make the square root EnKF equations consistent with covariance localization without the use of perturbed observations using a parallel direct approach.

The EnKF was originally introduced by Evensen (1994) as a method to estimate the covariances used in the Kalman filter equation (Kalman 1960) using an ensemble of model predictions. This innovation made it feasible to run statistical data assimilation problems even on very large dynamical systems of size , where the full covariance matrices of size would be prohibitively large to store and invert. Instead, a small ensemble of size , where , is used to estimate the necessary covariance matrices.

Two crucial issues arise in the use of the EnKF as originally formulated. The first crucial issue is the systematic underestimation of ensemble spread that occurs when assimilating the same observations to update both the ensemble mean and the covariance with the unmodified Kalman equations. Two main families of solutions have been devised to address this problem. The first approach (Houtekamer and Mitchell 1998; Burgers et al. 1998) is to perturb observations with independently sampled noise to update each ensemble member. However, this introduces sampling error, which causes the filter to be suboptimal—especially when is small—as demonstrated by Whitaker and Hamill (2002). The other approach, which uses the same nonperturbed observation for each ensemble member, is the ensemble square root filter (ESRF). This method explicitly accounts for the underrepresentation of error stemming from the use of deterministic observations by adding a square root term to the Kalman update for the ensemble. Various flavors of ESRF have been developed (Bishop et al. 2001; Anderson 2001; Whitaker and Hamill 2002) that Tippett et al. (2003) showed are all equivalent, in the sense they perform analysis in the same vector space and find the same covariance.

The second crucial issue is that the sample covariance, based on a low-rank estimate, may contain spurious correlations between two distant points. This issue decreases in severity as the ensemble size grows, but because by design , the issue of spurious correlations cannot be completely eliminated by increasing the ensemble size. Therefore, a covariance localization method, which uses a Schur product (component-wise multiplication) to zero out correlations further than a specified distance (Gaspari and Cohn 1999; Houtekamer and Mitchell 2001; Hamill et al. 2001), is typically employed to reduce spurious correlations. This also increases the rank of the covariance matrices.

Without covariance localization, the ESRF equations can be solved in a serial fashion as the Kalman gain commutes. The commutativity property of the serial ESRF solution is lost when covariance localization is used, as the Schur product is a nonlinear operation, and the serial ESRF approximation does not correctly update the high-rank localized matrix. Issues began to appear in the combination of the serial ESRF with covariance localization in the differences in results in Holland and Wang (2013) using a simplified primitive equations model and, subsequently, Nerger (2015) and Bishop et al. (2015) clearly demonstrated the suboptimality of the serial approach when used without perturbed observations, while Bishop et al. (2015) presented the CHEF framework that provides consistent updates to the mean and covariances with perturbed observations. In the presence of covariance localization without perturbed observations, assimilating all observations at once is a theoretically preferable method. However, the challenge of how to implement this in a scalable way in actual full-scale data assimilation systems seems daunting at first. In this paper, we demonstrate how to take advantage of modern, massively parallel linear algebra system libraries to solve these equations in a scalable way. This becomes the basis for our parallel direct ESRF filter.

### a. Motivation

Several parallel EnKF implementations are described in the literature, including Keppenne and Rienecker (2002), Zhang et al. (2005), Anderson and Collins (2007), and Wang et al. (2013). These methods do not specifically address the issue of the clash between covariance localization and the serial approach that causes the ordering of observations to become important. Furthermore, as shown in Anderson and Collins (2007), an additional complexity that arises when using the serial filter in a parallel setting is that the forward-calculated observations (where *H* is the possibly nonlinear operator mapping the model space to the observation space and is the model/state variable) must be either recomputed (which would require communication among processors) or treated as part of the augmented state that is also updated during the serial filter. This second approach serves to alleviate a major problem with parallelizing the serial filter, but it still requires observations to be assimilated one at a time. Indeed, the very concept of “serial” observations seems antithetical to parallelization, by definition. Therefore, perhaps a reevaluation of the serial filter is in order, as the data assimilation problem size continues to grow.

In Bishop et al. (2015) a method is described to ensure the assimilation update does not depend upon the order of the observations and serial batches of perturbed observations are assimilated by correctly updating the true high-rank localized covariance matrices. In this paper we present a method for solving the global ESRF equations with covariance localization such that the solution is likewise independent of ordering, but in our case the observations are assimilated all at once. Because we work with eigenpairs of the forward-calculated observation covariance matrix, it is natural for us to examine the spectrum of this matrix. We demonstrate that the effective rank of the forward observation covariance matrix decreases as a function of localization length scale. This method can therefore be used to study the eigenspectrum of different localization techniques.

### b. The HWRF Ensemble Data Assimilation System

For our base data assimilation system, we use the Hurricane Ensemble Data Assimilation System (HEDAS), developed by the Hurricane Research Division (HRD) of the NOAA Atlantic Oceanographic and Meteorological Laboratory (AOML) under NOAA’s Hurricane Forecast Improvement Program (HFIP) (Aksoy et al. 2012, 2013; Aksoy 2013; Vukicevic et al. 2013; Aberson et al. 2015). HEDAS is based on a serial implementation of the ESRF and is specifically designed to assimilate hurricane observations collected by reconnaissance aircraft. Its focus is to generate a vortex that reasonably matches flight data, and it includes covariance localization at heuristically determined length scales. The application uses 30 ensemble members. The initial and lateral boundary ensemble perturbations are obtained from operational National Centers for Environmental Prediction Global Ensemble Forecast System analyses. Aksoy et al. (2013) showed that the primary circulation in HEDAS exhibits correlations with their observed counterparts of 87%– 97%. HEDAS also produces good analyses in other aspects of the structure of the primary circulation, such as the radius of maximum winds, wavenumber 1 asymmetry, and storm size. As noted by Nerger (2015), the suboptimal effect of the noncommutativity of observations and covariance localization is hypothesized to be small in realistic systems when the assimilation produces small changes to the background, and as HEDAS distributes observations among multiple frequently spaced assimilation cycles (Aksoy 2013), this is likely to be the case, especially after the first assimilation cycle. Nonetheless, in this paper we are interested in describing a new scalable, parallel, optimal ESRF core built upon the HEDAS processing framework that overcomes the limitations described above.

## 2. Direct solution of square root observation filter equations: Theoretical aspects

In this section we develop the theoretical basis for our parallel direct solution to the ESRF equations using covariance localization. The important symbols referenced in this paper are listed in the appendix to assist the reader.

### a. Solution of ESRF matrix function via eigenpairs

For the ensemble mean of size and ensemble perturbations of size , the square root ensemble Kalman filter without perturbed observations (Whitaker and Hamill 2002) can be written as an update to the analysis *a* from the previous forecast *f* as

where () is the observations, ) is the mean of the forward-calculated observations, is the mean-subtracted *i*th observation operator acting on the *j*th ensemble member . Term is (as is , a matrix filled with zeros), and the assumption is that . The traditional Kalman gain, (), is

where is the covariance between (an random variable representing the previous forecast) and (the observation operator acting on this random variable). Equation for , and is the observation error covariance for a random variable representing the true observations without observation noise.

Term (), the correction from using nonperturbed observations, is

These equations produce the minimum error variance estimate of the state and covariance given the observations, a result that holds even with non-Gaussian errors and nonlinear observations (Kalnay 2003).

As it is difficult to compute the square roots and inverses in Eqs. (1)–(3), the so-called serial approach is usually taken, but in this work we aim to directly solve Eq. (1). Since involves the square root^{1} of and , is more difficult to work with than . These square roots will require analysis of eigenvalue/eigenvector pairs.

To begin, it simplifies our analysis greatly if we remove the square root of by noting that can be written as the identity matrix through the transformation

where the “old” subscript represents the untransformed observation. In addition, the observation operator is also scaled as . The effect of this transformation is that is now identity. This changes the covariances from the original ESRF problem, but assimilating observations with different units (or no units, as with this transformation) should not lead to drastically different analyses.^{2} Note that for the diagonal observation error matrix typically considered, multiplying by amounts to dividing each observation by the standard deviation of the observation error; for correlated nondiagonal , this transformation removes that correlation by using principal components.

We now begin our analysis. Rearranging Eq. (3), we have

for . By Eq. (4), . Let denote the *i*th eigenpair of . Then

For some square matrix , as the eigenvectors for are the same as for any scalar *α* and *p*, while the eigenvalues are (Higham 2008), we have

therefore,

for

Note that as is a symmetric positive semidefinite covariance matrix. As , goes to 0, while for , goes to 1/2.

With covariance localization, without the loss of symmetry or positive semidefiniteness, the matrix in our implementation is formed by a Schur product,

where is the localization matrix arising from a localization function (Gaspari and Cohn 1999) such that

where is the distance between the location of the *i*th and *j*th observations,^{3} and is the characteristic length scale for the localization function . Matrix is defined as

By the Schur product rank inequality theorem,

Since the rank of is , we have

Assume for a moment for all *i* and *j*. The rank of depends on *L* and the distances between observations. When , all observations become completely independent (as for the Kronecker *δ*) and is full rank, while when , all observations are completely correlated (as for all ) and , the same as without localization.

Returning our attention to Eq. (10), for , the corresponding of will be approximately (which makes invertible regardless of whether is). The remaining eigenvectors corresponding to the approximately eigenvalues are orthogonal to for , where .

With this in mind, suppose we set all eigenvalues to 1/2 for for some such that is small. Thus, define the matrix operating on the *i*th eigenvector as

The difference between and is 0 for , and for it is

by Eq. (7),

The Taylor series expansion around 0 for . Substituting,

Simplifying,

This gives an upper bound of the error as

for . Since the two (spectral) norm of some matrix is the square root of the largest eigenvalue of , we have

where is the index of the first neglected eigenvalue. Therefore, if we can devise a numerical method to find the largest eigenvalue eigenpairs of the sparse matrix , we can accurately and efficiently directly solve the difficult part of the square root ensemble Kalman filter equations. In particular, as the space spanned by the constant 1/2 eigenvalues is orthogonal to the space spanned by the first *r* eigenvectors, we have

where is the *j*th negative observation perturbation; is the projection of along , that is,

and is the projection of into the space spanned by the first *r* eigenvectors, that is,

For , the same method can be used on instead of , that is,

for and .

Equation (1) is now solved by premultiplying and by

for

where is the distance between the location of the model state *i* and observation *j* with the same localization function as Eq. (11), and

### b. Rank-deficient observation covariance matrix

In the previous section, we developed an algorithm to solve the ESRF equations directly by finding the *r* largest eigenvalues of and proved that the eigenpair-based method solves Eq. (1) to within a configurable tolerance. However, the terms and are problematic. If the matrix is full rank, then these terms will be zero, since will form a basis for and for all . In the case that is rank deficient, these terms are allowed to contribute to the inverse directly without input from the ensemble covariance. In other words, if is rank deficient, then there will be a vector such that (i.e., is in the null space of ), and therefore this portion of can be added into the inverse without being impacted by the observation covariance. This term represents pure noise and degrades the quality of the solution.

To see this more clearly, consider that the eigenvalues will approach zero until , but there is a fundamental difference between eigenpairs for and . From Eq. (25), the matrix inverts so that the largest eigenvalues become the smallest and vice versa. This means when is large, the contribution from dominates (and therefore , while when is small, the contribution from dominates (so that ). Thus, the inverse is a weighting of the relative uncertainties between the observation ensemble and the observation error covariances. When , the weighing erroneously asserts that the contribution from the ensemble-estimated observation covariance is zero; that is, the ensemble covariance is *completely certain* of this mapping direction, when in fact according to this direction is actually a linear combination of the other observations.

The problem comes about because, according to the modeled , the covariance of can be described with only linearly independent vectors; that is, there is some linear dependence between the observations. To fix this, let us conceptually assimilate only principal component observations. Consider

where is a matrix with columns of the first *r* eigenvectors of (, ). Thus,

where is a diagonal matrix with the first *r* eigenvalues of . Furthermore, for the observations ,

a diagonal matrix, and so likewise is

The inverses of these matrices are now easy to compute as

Finally, while it is theoretically possible to directly work with , for observation-space covariance localization in , we must project back to the observation space so that each observation has an ascribable location. Therefore, writing in terms of the summation notation in Eqs. (22) and (25) above,

and

Note that the only differences between Eqs. (22) and (35) and (25) and (36) are the projection (noise) terms. Assimilating principal component observations removes these terms, and they would be zero if was full rank. Therefore, assimilating principal component observations is equivalent to neglecting the projection terms in Eqs. (22) and (25), and thus we need not actually directly assimilate principal component observations; the inverse matrix action can be transformed round-trip back into observation space, avoiding the null space of in the process. The projection-neglected Eqs. (35) and (36) are thus used with Eq. (26) to compute and in Eq. (1) to solve the ESRF problem.

Lastly, it is clear that with the eigenpair approach, *r* should be as close to as possible, for the smallest eigenvalues become the most important, most certain directions to project along. This can pose numerical difficulties that our eigenproblem solver must manage. We will address this issue in section 3.

### c. Model-space localization

Equation (10) localizes covariance in observation-space, meaning observations are assumed to have an ascribable location, and distances are computed based on the distances between these observations. However, unlike traditional observations such as dropsondes, integrated observations such as satellite radiances (which are affected by the entire vertical structure of the atmosphere and have a horizontal footprint based on the antenna size) do not have a single location. Campbell et al. (2010) demonstrated how the use of observation-space localization can degrade the quality of the analysis in these cases, while Lei and Whitaker (2015) showed that observation-space localization produces a superior analysis for a particular satellite radiance case. Both approaches are idealized models of how covariance decays (isotropically) as a function of distance, and which covariance model is superior is likely to be case dependent. This is an area of research that requires further investigation.

As we directly find the eigenpairs of the sparse matrix , our method is theoretically compatible with model-space localization. To use this approach, the observation operator tangent-linear and adjoint are applied to the localized model covariance as

where

and

where is the distance between the location of two model states *i* and *j* with the same localization function as Eq. (11). Equation (26) is changed analogously as

It is not necessary to form the prohibitively expensive full matrix, as only the locations of and need to be computed in Eq. (38). This can be accomplished pairwise for each observation *i* and *j* by computing the localized model points at the observation location overlaps. However, in addition to requiring observation operator tangent linear models and adjoints, this approach requires some bookkeeping and communication to track which observation operators use which model points, and for ease of implementation in this paper, we therefore demonstrate our methodology with observation-space localization without the loss of generality for the theoretical algorithm.

### d. Comparison to other selected eigenvalue-based EnKF methods

Previous works have used eigenproblem-based methods to solve the EnKF equations based on similar principles but without, to the authors’ best knowledge, a full accounting of covariance localization and rank deficiency. Bishop et al. (2001) uses the eigenvalue decomposition of the matrix to find the solution of the inverse in Eq. (1). This is equivalent to finding the Sherman–Woodbury update as shown in Tippett et al. (2003). However, this approach cannot be used in the presence of covariance localization, as Eq. (10) includes the Schur product and increases the rank of the matrix , requiring either computation of all nonsingular eigenvalues of or the columns of the square root of the localization matrix as in Bishop and Hodyss (2009) using the so-called modulation product. Posselt and Bishop (2012) showed how to extend the findings of Bishop et al. (2001) in a computationally efficient way to where there are more ensemble members than observations making the matrix full rank; we extend this analysis without this assumption. Hamill and Snyder (2002) used eigenvectors of to test the impact of a single radiosonde on the mean forecast, assuming the matrix is full rank. Our paper likewise extends this to multiple (possibly rank deficient) observations with covariance localization, updating both the mean and ensemble. Furrer and Bengtsson (2007) examined the issue of covariance localization in the ensemble Kalman filter and found several approaches to estimate “taper matrices,” which approximate the localized covariance matrix, yet are not necessarily positive semidefinite. In this work we will directly consider the eigenpairs of the localized covariance matrix instead.

## 3. Direct solution of square root observation filter equations: Numerical implementation

In the previous section we developed a theoretical method for directly solving the ESRF Eq. (1) by computing the eigenpairs of such that even a rank-deficient matrix can be utilized. Finding eigenpairs of a matrix is a prevalent problem across many different research fields (e.g., Borsanyi et al. 2015; Keçeli et al. 2016) as well as industry (e.g., Bryan and Leise 2006). Therefore, we are able to benefit from the massive amounts of effort that has been expended in this area by choosing an appropriate library.

### Finding the eigenpairs of

In the previous section we showed that finding the eigenvectors and eigenvalues of was sufficient to solve the most difficult part of Eq. (1)—the matrix inverse of , which involves a square root of , which in turn involves a Schur product.

is a sparse matrix. In HEDAS, where frequent cycling is used, the number of assimilated observations in a single cycle is on the order of , and therefore these matrices can be represented directly in memory spread across multiple machines. Even a full double-precision matrix would correspond approximately to 150 GB of memory, and with 32 GB of RAM per node, the matrix can be represented in memory distributed across approximately five or six computational nodes. Furthermore, because of covariance localization, these matrices are sparse enough that the covariance matrix can fit in memory on just one or at most two nodes. However, as the number of observations increases, it may become necessary to represent only in terms of the matrix action upon some vector as . Furthermore, as the data assimilation problem continues to grow, we would benefit enormously from the experience of the computational science community with different methods to store and solve sparse linear algebra and eigenvalue problems.

We wish to use a flexible solver that is capable of handling multiple types of problems (e.g., if we required using a matrix action at some point in the future) and to take advantage of different linear algebra techniques (such as direct and iterative matrix solvers). The Scalable Library for Eigenvalue Problem Computations (SLEPc; Hernández et al. 2003, 2005; Campos and Roman 2012; Roman et al. 2016), which is built upon the Portable, Extensible Toolkit for Scientific Computation (PETSc; Balay et al. 1997, 2016a,b), meets these requirements. PETSc recently celebrated its 20th anniversary (http://www.mcs.anl.gov/petsc/petsc-20.html), and SLEPc/PETSc has been used successfully by Keçeli et al. (2016) on over 266 000 cores to extract 330 000 eigenpairs in approximately 190 s. In short, it is a mature and flexible and scalable library with developmental support and a wide-ranging user base.

We now describe our parallel numerical implementation built upon this framework. We first must divide the computational domain among processing elements (PEs). We wish to allow all observation operators to be computed in parallel, so each PE must have all the model points necessary to compute the operator. With this in mind, while additional observations remain, the least-utilized PE is chosen, and the observation and all model points necessary to compute it are assigned to that PE (duplicating model indexes if necessary; however, only one PE is assigned as the owner of a given model point). Once all observations have been distributed, the remaining model indexes not directly used in the computation of observation operators are divided among the remaining PEs. An attempt is made to keep vertical columns of the domain together on one PE to improve performance. For any given model point, then, multiple PEs may load the data but only one will be marked as the owner; for each observation, only one PE will be responsible for computing and transferring that observation.

For a given PE, the selected indices across all ensemble members are loaded from disk using Parallel netCDF (Li et al. 2003), and the forward observation calculations and quality control are performed in parallel and then broadcasted to all PEs. We now divide into groups by assigning a starting and ending row for which a given PE will be responsible.

For our eigenvalue problem, we use the multiple shift–invert Lanczos method (Bai et al. 2000; Aktulga et al. 2014; Keçeli et al. 2016). A two-step process is used. The first stage divides the eigenvalue spectrum, in parallel, into regions with approximately equal numbers of eigenpairs. The second step computes all eigenvalues and eigenvectors, in parallel, within each region from the previous step. Each region is assigned multiple processors, allowing for two levels of parallelism.

Let be the number of regions and be the number of PEs per region. Therefore,

where is the total number of PEs.

To compute the matrix spectrum of a matrix , the matrix inertia theorem is utilized, which states that the number of eigenvalues in the region , for , can be found by decomposing [in our case, in a parallel and distributed way by calling from PETSc the Multifrontal Massively Parallel Sparse Direct Solver (MUMPS); Amestoy et al. 2000] both

and

where and are lower triangular and and are block diagonal. The difference between negative entries of and in Eqs. (42) and (43) is the number of eigenvalues in the region . As in Aktulga et al. (2014), an evenly spaced first guess of region divisions is given to each of the groups, and the matrix inertia is computed at each of these points. The resulting spectrum is then interpolated so that each region has an approximately equal number of eigenpairs for efficient division of effort. The matrix inertia at these divisions is computed once again to ensure the eigenvalue spectrum is exact. As mentioned in Aktulga et al. (2014), this spectrum slicing strategy is a computationally expensive step ultimately worth the cost in order to achieve efficient distribution of effort for what follows.

Once each processing region (with PEs) is assigned a portion of the eigenvalue spectrum, the shift–invert Lanczos method is used within SLEPc to compute all of the known number of eigenvalues and eigenvectors within this region (Campos and Roman 2012). The left and right endpoints compute the smallest and largest eigenpairs, respectively, both of which SLEPc can handle. Because it is easier to work with a nonsingular matrix, we compute the eigenpairs of and subtract 1 from the eigenvalues to give .

After the eigenpairs have been computed, Eqs. (35) and (36) are found and then distributed to each PE. Since (where ) is and is [which again are computed as sums in the right-hand sides of (35) and (36)], these are combined into , a matrix, which is relatively small. Each processing element then computes for each model index it is responsible for to solve Eq. (1) in an embarrassingly parallel way. This is possible, as each PE has the entire matrix and observation locations in memory, and only the local model analysis grid points are required to compute the localization and matrix product terms in Eq. (26) in with observation-space localization.^{4} This is sufficient for each PE to solve the Kalman equations; that is, there is no need for any explicit communication among processors after has been distributed except to write the computed analysis to disk.

## 4. Numerical results

We implemented the abovementioned algorithm, including installing and testing the SLEPc/PETSc library, in Fortran on NOAA’s Research and Development High Performance Computing Program Jet supercomputing system. We tested this with a Hurricane Edouard (2014) case with 30 ensemble members and approximately 8-K quality-controlled observations from sources including satellite retrievals and the NASA AV-6 Global Hawk 20140916GH Storm Survey mission (Zawislak et al. 2016; Rogers et al. 2016). The details of the evolution of the tropical cyclone can be found in Stewart (2014). As with other HEDAS studies, the initial conditions were initialized from the GFS ensemble 6 h before assimilation; for more information see section 1b.

We test assimilating these observations with a fixed covariance localization length *L* from Eqs. (27) and (11) set to and 240 as from Eq. (4.10) of Gaspari and Cohn (1999). The unit on *L* is the number of grid cells, each of which is approximately 2 km, on the total domain of size 232 451 60 longitude/latitude/vertical model grid points, and *L* represents the number of horizontal grid points before the correlation reaches 0. Vertically the correlation length scale is ; in other words, the vertical distance is scaled by so that after 15 vertical levels, the correlation goes to 0. While we vary the horizontal *L*, is fixed at 15.

The eigenvalue spectrum of these different cases is shown in Fig. 1, which shows that the number of eigenvalues (i.e., the rank of ) grows larger as *L* grows smaller. For all values of *L*, there is a sharp decay in the eigenvalue spectrum after ; likewise, there are few eigenvalues larger than .

The prior forecast and analysis means based on Eqs. (35) and (36) are shown in Fig. 2 for temperature in the midlevels (35 out of 60, where the first level is the surface and level 60 corresponds to 50 hPa). The prior demonstrates a symmetric warm core, while in this case the analyzed vortex is not nearly as symmetrical. That the vortex may be unbalanced in the analysis is a common problem in data assimilation (Nehrkorn et al. 2015), and there are multiple potential remedies (Beezley and Mandel 2008; Kalnay and Yang 2010; Nehrkorn et al. 2015). The approach currently used in HEDAS is to distribute observations among frequently spaced assimilation subcycles using a storm-relative approach (Aksoy 2013), allowing the model dynamics to ease toward the observations over multiple assimilations. The results shown here are from the first and only such cycle, and therefore still include the deleterious effects of feature misalignment. However, as our intent is primarily to demonstrate the feasibility of this implementation on a relatively large set of observations, we do not account for these issues here.

Figure 3 shows the same analysis with the *rank-deficiency projection error* of the potentially rank-deficient observation covariance matrix. We define this projection error to be the effects of using Eqs. (22) and (25), which amounts to possibly erroneously assuming is full rank, instead of the corresponding corrected versions Eqs. (35) and (36), respectively, that address this rank deficiency. The rank-deficiency projection error grows larger as a function of *L* (the correlation length scale) as the rank of the matrix decreases and the impact of each observation is over a larger part of the domain. Figure 4 shows the absolute difference between the rank-deficiency-neglected analysis and the corrected analysis of the 10-m wind speed, first arising from differences in ordering (the difference is zero to double-precision machine ε), and then as a function of correlation length scale. The features of this projection error also suggest how the eigenpairs (in this case the extraneous linearly dependent directions) exhibit complex spatial structure.

Figure 5 shows the root-mean-square rank-deficiency projection error (at all points over the entire domain) and maximum projection error as a function of model level (the first level is the surface) for temperature, water vapor, and condensed water mass for our Edouard case for the different *L* values. For the longest *L*, the RMSE can reach 0.55 K, 0.65 g kg^{−1}, and 0.25 g kg^{−1} for temperature, water vapor, and condensed water mass, respectively, in the middle to upper troposphere. The maximum projection error arising from the projection onto degenerate directions can reach values on the order of approximately 5 times the RMSE. In short, the projection error can be surprisingly large and likely physically meaningful, especially with larger values of *L*.

The eigenvalue computations took between 25 s () and 120 s () with 384 PEs —in other words, fewer eigenvalues actually take longer to compute. This is because the matrix becomes more dense as *L* grows and working with the matrix takes longer. In all the cases tested, the error of was less than for all *i*.

We conducted our tests on the NOAA Jet supercomputing system (tjet processing queue) where each node has 12 cores with a 2.66-GHz CPU and 2 GB of RAM, each connected via a quad data rate (QDR) InfiniBand. Tests determined that (six PEs per region) performed best. The peak flops/node is 127.7 gigaflops, while in the SLEPc portion of the code we were able to achieve a peak performance of approximately 440 gigaflops with 384 PEs with this setup—about 11% of the theoretically optimal performance. Figure 6 shows the increase in computation speed as a function of the number of PEs with a fixed . As shown, the time to distribute the observations and model indices across the PEs (Fig. 6a) is a fixed time, as it is currently done in serial. Parallel netCDF read performance (Fig. 6b) showed improvement as the number of PEs increased, but parallel writes (Fig. 6e) showed a marked decrease in performance as the PEs competed with one another for access to the disk head. The eigenproblem computation (Fig. 6c) and filtering time (Fig. 6d) [i.e., solution of Eq. (1) once has been distributed to all PEs] increase even faster than linearly as a result of the advantages of the smaller matrix sizes fitting in cache memory. Therefore, overall (Fig. 6f) the problem is input/output (I/O) bound, and improvements to the parallel file system would be necessary for additional performance gains. This makes sense, as the problem of reading and writing all variables is by far the most expensive operation in data assimilation. An additional order of magnitude of observations than our current test would likely necessitate additional PEs, but for the current problem size (), 384 PEs achieve the maximum parallel performance gain.

From these results and other results with the SLEPc library (e.g., Keçeli et al. 2016), we have every reason to believe this parallel direct method will be able to scale observations—or more given additional computational resources, especially with further improvements to the parallel file system.

## 5. Conclusions

In this paper we describe an algorithm to compute, in parallel, a direct solution to the ensemble square root Kalman filter (ESRF) equations in a way that does not depend on the ordering of observations. The previous approach for solving the ESRF equations, based on a serial assimilation of observations, became suboptimal when covariance localization was introduced, as the covariance localization operation is not linear and does not commute with serial assimilation in ESRF implementations described to date. While a consistent serial non–square root EnKF was previously presented in the CHEF implementation of Bishop et al. (2015) with the use of perturbed observations, this work describes an equivalently consistent parallel direct ESRF implementation without the use of perturbed observations. To maintain consistency, we solve the ensemble square root Kalman filter equations directly in order to regain the optimal properties of the filter. We demonstrate how this can be done in a practical manner by using the theory of matrix functions to show that for the observation covariance matrix (which can include covariance localization), after making the observation error covariance white by the transformation , the eigenvectors of are the same as and the eigenvalues are the same as those found by “scalarizing” the function *f* and substituting , that is, . We demonstrate an error bound on the approximation. When using the first implementation described that explicitly considers the null space of the forward observation covariance, the solution contains a “projection error” when this matrix is rank deficient. We demonstrate how this projection error can be eliminated by conceptually assimilating principal component observations; in fact, this is exactly the same as removing the projection term from the full solution. The rank-deficiency projection error increases with localization length but can be avoided entirely by the approach we describe. We demonstrate a successful parallel and scalable numerical implementation of this algorithm and demonstrate how the assimilation behaves as a function of covariance localization length.

As first noted in Bishop et al. (2001) for the ensemble transform Kalman filter (ETKF), for the actual assimilation method we require only the matrix product of the matrices considered here. This fact leads to two realizations. First, it may be feasible to construct a method that avoids computing all of the eigenpairs and instead uses matrix functions to solve the matrix products in a much faster way by considering only the space spanned by the range of the most matrix product directions rather than the entire eigenspace. This approach would be far superior to computing the full eigenspace. Second, the question of the null space of the forward observation covariance matrix, while theoretically interesting, may not be a practical issue for most EnKF implementations unless the right-hand product terms have a nonzero projection into the null space of this matrix. Additional research is necessary to determine whether and when this “rank projection” error occurs when using localized covariance matrices.

The approach described in this paper is more complex than the serial ESRF approximation, but once integrated with an appropriate library for solving a large, sparse, symmetric, positive semidefinite eigenproblem (taking advantage of the pervasive nature of the eigenproblem across disciplines), the approach is not conceptually difficult and becomes highly amenable to a scalable parallel implementation. These libraries can also integrate many different computational best practices and bring the solution of the ESRF equation into line with the state-of-the-art of numerical linear algebra. In this approach, because observations are assimilated all at once, unlike the approach given in Anderson and Collins (2007), the observations do not need to be treated as augmented state variables for the parallel implementation. Once the eigenproblem has been solved, when used with observation-space localization as presented here, the remaining analysis becomes embarrassingly parallel, as there is no need to communicate across processing elements. The parallel direct ensemble square root Kalman filter approach described here is therefore a new class of unperturbed observation direct ensemble square root Kalman filter that is amenable to large-scale parallel implementations.

## Acknowledgments

The authors acknowledge the vital contributions of two anonymous reviewers. This research was partially funded by the NOAA Hurricane Forecast Improvement Project Award NA14NWS4680022. This research was carried out in part under the auspices of CIMAS, a joint institute of the University of Miami and NOAA (Cooperative Agreement NA67RJ0149). The authors acknowledge the NOAA Research and Development High Performance Computing Program for providing computing and storage resources that have contributed to the research results reported within this paper (http://rdhpcs.noaa.gov). We gratefully acknowledge the extensive technical support of Dr. José Roman.

### APPENDIX

## REFERENCES

*Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide*, Software, Environments and Tools, Vol. 11, SIAM, 410 pp.

*Modern Software Tools for Scientific Computing*, Springer, 163–202.

*High Performance Computing for Computational Science—VECPAR 2002*, Lecture Notes in Computer Science, Vol. 2565, Springer, 377–391, doi:.

*Functions of Matrices: Theory and Computation*. Other Titles in Applied Mathematics, SIAM, 413 pp., doi:.

*Atmospheric Modeling, Data Assimilation and Predictability*. Cambridge University Press, 341 pp.

*SC2003: Igniting Innovation*;

*Proceedings of the ACM/IEEE SC2003 Conference*, 39, doi:.

## Footnotes

© 2017 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

^{1}

In this paper by we mean rather than the alternative definition .

^{2}

While a similar normalization technique is used in Bishop et al. (2001), our main motivation is to treat the scaling term as a preprocessing step in order to avoid carrying it through the equations. This changes the original assimilation problem in the presence of covariance localization [see Eqs. (10) and (26) below] so that the scaled observations and operators are localized and assimilated. This approach is otherwise equivalent to the nonscaled problem but is easier to solve.

^{3}

Note to ensure positive semidefiniteness of , the distance defined here must follow the rules of a metric, namely, the distance is nonnegative, a distance is zero if and only if the points are the same, the distance between two points is symmetric, and the triangle inequality between three points holds. As a practical note, the mismatch of different length scales may violate the third or fourth requirements.