Nonlinear Least Squares En4DVar to 4DEnVar Methods for Data Assimilation: Formulation, Analysis, and Preliminary Evaluation

The En4DVar method is designed to combine the ﬂow-dependent statistical covariance information of EnKF into the traditional 4DVar method. However, the En4DVar method is still hampered by its strong dependence on the adjoint model of the underlying forecast model and by its complexity, maintenance requirements, and the high cost of computer implementation and simulation. The primary goal of this paper is to propose an alternative approach to overcome the main difﬁculty of the En4DVar method caused by the use of adjoint models. The proposed approach, the nonlinear least squares En4DVar (NLS-En4DVar) method, begins with rewriting the standard En4DVar formulation into a nonlinear least squares problem, which is followed by solving the resulting NLS problem by a Gauss–Newtoniterative method. To reduce the computational andimplementation complexity oftheproposedNLS-En4DVarmethod,afewvariantsofthenewmethodareproposed;thesemodiﬁcations make the model cheaper and easier to use than the full NLS-En4DVar method at the expense of reduced accuracy. Furthermore, an improved iterative method based on the comprehensive analysis on the above NLS i -En4DVar family of methods is also proposed. These proposed NLS i -En4DVar methods provide more ﬂexible choices of the computational capabilities for the broader and more realistic data assimilation problems arising from various ap- plications.TheprosandconsoftheproposedNLS i -En4DVarfamily ofmethodsarefurtherexaminedinthe paper and their relationships and performance are also evaluated by several sets of numerical experiments based on the Lorenz-96 model and the Advanced Research WRF (ARW) Model, respectively.


Introduction
Data assimilation for numerical weather prediction (NWP) has experienced explosive growth and development after ensemble Kalman filter (EnKF; Evensen 1994;Houtekamer et al. 2014) and four-dimensional variational data assimilation (4DVar; Rabier et al. 2000) techniques were successively introduced as two major competing methods for initializing NWP. It is their competition (Kalnay et al. 2007;Lorenc 2003) and their interplay (Gustafsson 2007) that have fueled the rapid growth and development in this area of research over the past 20 years. On one hand, successful applications of 4DVar at several major operational NWP centers (e.g., ECMWF, and those of France, the United Kingdom, Japan, and China) demonstrate that its development has reached an advanced and mature stage. A consensus view point in the NWP community is that 4DVar has advantages in producing highly accurate predictions and having the ability to assimilate asynchronous observations simultaneously. On the other hand, 4DVar has been hampered by its strong dependence on the adjoint model of the forecast model and its complexity, maintenance requirements during operations, and the high costs associated with programming and computer simulations. Moreover, the adjoint models are often very difficult to obtain when the model physics involves substantial ''on-off'' processes (Xu 1996;Zou et al. 1997;Steward et al. 2012), as these limitations call for improvements and for developing more efficient methods such as EnKF or some nonsmooth optimization approaches (e.g., Karmitsa 2016). In contrast, EnKF has become increasingly competitive with 4DVar due to its capability of providing flow-dependent error estimates from the statistics of the ensemble samples. However, the performance of EnKF could be compromised by sampling errors because of limited sample sizes, which has indeed been a difficult issue throughout the development of EnKF.
Recently much effort has been devoted to developing hybrid methods based on 4DVar and EnKF, which could ideally inherit the advantages of both methods but avoid their deficiencies (Gustafsson 2007;Kalnay et al. 2007;Bowler et al. 2017a,b;Lorenc 2017;Lorenc et al. 2017;Bannister 2017;Amezcua et al. 2017;Goodliff et al. 2017). For example, on the EnKF side, Hunt et al. (2004) extended EnKF to four dimensions in order to allow the use of asynchronous observations, which thus introduced the 4DEnKF method. The 4DEnKF approach can assimilate observations at selected time points just like 4DVar does but without iteration. Evensen and Van Leeuwen (2000) proposed the ensemble Kalman smoother to achieve the same purpose. On the 4DVar side, the ''hybrid'' (e.g., Clayton et al. 2013), the ''En4DVar'' (e.g., Zhang et al. 2009), and the ''4DEnVar'' assimilation methods (Qiu et al. 2007;Liu et al. 2008;Tian et al. 2008;Wang et al. 2010;Tian et al. 2011;Tian and Xie 2012;Tian and Feng 2015) were successively proposed to incorporate the statistical flow-dependent strength of EnKF into 4DVar. It was noted by Lorenc (2013) that the hybrid 4DVar uses a combination of climatological and ensemble covariances and the En4DVar utilizes the ensemble background covariance provided by the parallel EnKF implementation. It should be noted that the adjoint model is often used to carry out the minimization procedure in both the hybrid 4DVar and the En4DVar, but the 4DEnVar approach utilizes the simulated observation perturbations (OPs) to approximate the linearized tangent operator and hence avoids using the adjoint models. Indeed, some variations of the 4DEnVar without iteration are similar to 4DEnKF (Tian et al. 2011;Hunt et al. 2004). This phenomenon further explains the necessity of their (4DVar and EnKF) integration. In comparison, the competitiveness of En4DVar is in its more accurate description of the tangent linear model (when the linearity validity of the linear model is satisfied) while the advantage of 4DEnVar is its ease in programming. A natural question is whether we can achieve a balance between 4DVar and EnKF in terms of performance and accuracy. Such a balance, it seems, only can be achieved through a comprehensive analysis of all the factors involved, including assimilation accuracy, programming difficulty, and computational complexity.
The primary objective of this paper is to address the above question. We begin by reviewing the standard En4DVar formulation and then present its reformulation as a nonlinear least squares (NLS) problem. A Gauss-Newton iterative algorithm is then proposed to solve the reformulated NLS problem. Since the exact implementation of the Gauss-Newton algorithm is computationally expensive, we propose various levels of simplifications of the full Gauss-Newton algorithm with decreasing computational complexity. Among these simplified algorithms, we recovered the NLS-En4DVar algorithm proposed by Tian and Feng (2015). If only one iteration is retained in the NLS-En4DVar algorithm, it is further reduced into POD-4DVar (Tian et al. 2008;Tian et al. 2011), which is a typical 4DEnVar method without iteration.
The remainder of this paper is organized as follows. In section 2, we first describe the alluded to NLS reformulation of the standard En4DVar and the Gauss-Newton algorithm. We then present several simplified algorithms of the full Gauss-Newton algorithm and examine their pros and cons. In section 3, we carry out several numerical experiments using the Lorenz-96 model to evaluate the performance of all the proposed algorithms with a focus on their assimilation accuracy and computational complexity. We then perform a group of evaluation experiments using the Advanced Research WRF (ARW) Model to further exploit the potential applicability of three members of the NLS i -En4DVar family of methods to real atmosphere and/or ocean models. Finally, we finish the paper with a summary and a few concluding remarks in section 4.

Methodology
The incremental form of 4DVar is defined as seeking the analysis increment for the initial conditions (ICs), x 0 a (at the initial time t 0 ), such that x 0 a minimizes the following cost function: It is easy to check that Eq. (1a) can be rewritten as where x 0 2 R mx (m x is the dimension of the state vector) is the perturbation of the background field x b at t 0 and Here, the superscript T stands for the matrix transpose, y obs,k 2 R m y,k is the observation at time incidence t k , m y,k is the dimension of the observational vector y obs,k , b is the background value, S 1 1 is the total number of observation time points in the assimilation window, H k is the observation operator, and B and R k are the background and observation error covariance matrices, respectively. All the symbols used in this paper follow the general conventions and are described in Table 1. In En4DVar, the background error covariance matrix B is approximated by the ensemble covariance matrix B e as where P x 5 (x 0 1 , x 0 2 , . . . , x 0 N ) is the prepared initial perturbations (IPs) and N is the ensemble size.
To filter out the noise resulting from undersampling (Houtekamer and Mitchell 2001), a decomposed correlation function operator (Liu et al. 2009;Buehner 2005) is often applied to modify the IP matrix P x as follows: and thus where P x,r 2 R mx3(r3N) [r is the number of eigenvectors chosen; see Liu et al. (2009)] is the localized IP matrix, P x,j * ( j 5 1, . . . , N) is an m x 3 r matrix whose every column is the jth column of P x , r x 2 R mx3r , and r 5 r x r T x . Here, r 2 R mx3mx is the spatial correlation matrix Zhang and Tian 2017, manuscript submitted to J. Geophys. Res. Atmos., hereafter ZT) and B+C stands for the Schür product of matrices B and C, which is a matrix whose (i, j) entry is given by b i,j 3 c i,j . In real data assimilation problems, the direct decomposition of r (5r x r T x ) may not be feasible because of its huge size. ZT proposed an efficient local correlation matrix decomposition approach to compute the decomposition r 5 r x r T x . For ease of the subsequent presentation, we define the operator ,e . in Eq. (6a).
To avoid the direct computation of the inverse of matrix B e,r , we further express the analysis solution x 0 a by the linear combinations of P x,r as x 0 a 5 P x,r b: Substituting Eqs. (6) and (7) into Eq. (1) and expressing the cost function in terms of the new control variable b yields Obviously, because of the nonlinearity, the solution to Eq. (8) can be computed iteratively by performing evaluations of the cost function Eq. (8) and its gradient, using a suitable descent algorithm [e.g., the limitedmemory quasi-Newton method (L-BFGS) method; Liu and Nocedal (1989)]. Here, M 0 k and H 0 k are the tangent linear (TL) models of M t 0 /t k and H k , respectively. Their respective adjoint models are (M 0 k ) T and (H 0 k ) T . It is not hard to see that the adjoint models (M 0 k ) T and (H 0 k ) T are indispensable to the En4DVar problem given by Eqs. To circumvent the use of the adjoint models, we follow Tian and Feng (2015) by first rewriting Eq. (8) as a nonlinear least squares problem (Dennis and Schnabel 1996), which minimizes the following quadratic cost function: where and (R 1/2 1,k )(R 1/2 1,k ) T 5 R k . Consequently, the first-derivative matrix (or Jacobian matrix) J ac Q(b) of Q(b) can be computed as follows: TABLE 1. Summary of key notation used in this article, where m x is the size of the state space, m y,k is the number of observations at t k , S 1 1 is the number of time steps, r is the number of eigenvectors chosen in the EOF decomposition of correlation matrix r, and N is the number of ensemble members.

Notation
Description Size where I denotes the (N 3 r) 3 (N 3 r) identity matrix. Therefore, the Gauss-Newton iteration for the nonlinear least squares problem Eq. (10) is defined by (Dennis and Schnabel 1996) for i 5 1, . . . , I max , where I max is the maximum iteration number. Substituting Eqs. (11) and (12) into Eq. (13), we obtain and thus " where Db i 5 b i 2 b i21 . Consequently, we arrive at b i 5 b i21 1 Db i and then x i, 0 a 5 P x,r b i by solving the linear system Eq. (14b) using the conjugate gradient (CG) method or a preconditioned CG method.
We name the iterative algorithm in Eq. (14) NLS 0 -En4D-Var since it is the solution to the nonlinear least squares problem in Eq. (10). Compared with the original En4DVar iterative algorithm in Eqs. (8) and (9), the formulation of NLS 0 -En4DVar is derived from Eq. (8) through Eqs. (10)-(14) without making any additional assumptions. So the subscript 0 is used to indicate that the formulation Eq. (14) is derived from Eqs. (8) and (9) without any simplification; thus, Eq. (14) is mathematically equivalent to Eqs. (8) and (9), namely, NLS 0 -En4DVar is equivalent to the original En4DVar. However, the adjoint models are nicely avoided k P x,r ), which can be formed explicitly by integrating the tangent linear models. Furthermore, Eq. (14) is easier to implement and more portable because it does not rely on any existing optimization package (Tian and Feng 2017). One may argue that the computational cost may be significantly increased because the ensemble size is expanded to N 3 r from N after applying the localization [i.e. Eq. (6)]. However, the situation may not be as bad as first imagined in view of the rapid development of high-performance computers. The tangent model M ,i21 k has two basic input variables: x i21 and x 0 . In the ith iteration step, x i21 is assigned the value x i21 a , which is used to initialize the forecast model M t 0 /t k to prepare the parameters for the tangent model M 0,i21 k . After that, all the parameters of the tangent model are fixed during the ith iterative step (Zou et al. 1997). That means the computation of M 0,i21 k P x only needs to call the forecast model M t 0 /t k once, which is followed by N 3 r series of linear operations in M 0,i21 k . In addition, the implementation of N 3 r series of linear operations in M 0,i21 k can be carried out in parallel. The situation is further improved in the following simplified version of NLS 0 -En4DVar.

b. NLS 1 -En4DVar: Simplification by acting the localization in observation space
In view of the possible high computational cost of eval- k P x ) r in NLS 0 -En4DVar, which then leads to NLS 1 -En4DVar, which is defined as where k P x , and r y 2 R my3r should be computed altogether with r x (ZT). The following approximation is adopted to alleviate the computational cost because only N series of linear operations in M 0,i21 k P x , which is considerably less than the original N 3 r series of linear operations in M 0,i21 k P x,r in NLS 0 -En4DVar. It should be noted that solving (15) by using the CG method is still extremely challenging since the size (N 3 r) of b i is extremely large (;10 629 ) as a result of the localization implementation [Eqs. (6) and (16)]. To avoid the expensive computational cost resulting from the CG iterations, we first go back to the unlocalized version of Eq. (15): and then transform it into following format: We further mark and Therefore, Eq. (15 00 ) is further transformed as follows: Subsequently, we adapt the alternative localization scheme proposed by Tian and Feng (2015) to transform Eq. (15 000 ) into the following format, by localizing the matrices P #,i21 y,k , P * ,i21 y,k , and P x through (P * ,i21 y,k , e . r y ), (P #,i21 y,k , e . r y ), and P x,r 5 (P x , e . r x ). Apparently, Db i is now computed explicitly by Eq. (17) and the CG iterations are thus avoided. By direct calculation, we obtain which shows that the localization scheme adopted in Eq. (17) is completely equivalent to (but with higher computational efficiency than) that proposed by Tian and Feng (2015). In fact, the last line in Eq. (18) has exactly the same localization form as Eq. (30b) in Tian and Feng (2015). Here, r xy (5r x r T y ) is the correlation matrix between model states and observational variables (ZT).
c. NLS 2 -En4DVar: Further simplification by abandoning the use of the tangent models In NLS 1 -En4DVar, the tangent models M 0,i21 k and H 0 k are still retained and their implementation still could be a bit difficult despite being much easier compared to the implementation of their adjoint models. To further reduce the implementation complexity by abandoning the use of the tangent models, we introduce NLS 2 -En4DVar, which is formulated as follows: where (P * ,a y,k ) T 5 2(N 2 1) " å S k50 (P a y,k ) T R 21 k (P a y,k ) 1 (N 2 1)I # 21 3 " å S k50 (P a y,k ) T (P a y,k ) (P #,a y,k ) T 5 " å S k50 (P a y,k ) T R 21 k (P a y,k ) 1 (N 2 1)I # 21 (P a y,k ) T , (19c) P a y,k 5 (y 0a k,1 , . . . , y 0a k,N ), and In NLS 2 -En4DVar, the forecast model M t 0 /t k will be repeatedly integrated N times at the ith iteration to form P a y,k by means of P a y,k 5 L k (x i21 a 1 P x )2L k (x i21 a ), which certainly leads to expensive computational costs.

d. NLS 3 -En4DVar: Further simplification by fixing the OPs
To reduce the computational costs, we fix the simulated OPs, P a y,k 5 P y,k , during the iteration process and thus form the following NLS 3 -En4DVar method: where (P y,k * ) T 5 2(N 2 1) and It is interesting to note that NLS 3 -En4DVar is just the NLS-4DVar method (but with a modified localization scheme) proposed by Tian and Feng (2015).
e. NLS 4 -En4DVar: One iteration of NLS 3 -En4DVar Performing only one iteration of NLS 3 -En4DVar Eq. (20) with b 0 5 0, we obtain the following NLS 4 -En4D-Var method: which is just the POD-4DVar method proposed by Tian et al. (2011). Moreover, restricting Eq. (21) to the 3D case, we obtain b 1 5 (P # y , e . r y ) T R 21 y 0 obs (22a) and thus x 0 a 5 P x,r (P # y , e . r y ) T R 21 y 0 obs , where (P # y ) T 5 [(P y ) T R 21 (P y ) 1 (N 2 1)I] 21 (P y ) T (22c) and which is essentially the same as the LETKF method ); they differ only in their localization schemes. Here, H and R are the observation operator and observation error covariance, respectively, for the 3D case.
f. NLS 5 -En4DVar: An iterative improvement scheme Following Courtier et al. (1994), a linear approximation to Eq. (1a) is proposed, which is valid in the vicinity of x b , as follows:

> > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > :
Equation (23) is widely taken as the inner loop, in which the reduced resolution tangent linear and adjoint models with some simplified physical processes are the common choice of many operational centers. To account for some nonlinearities, an outer loop is added and the concept of guess field x i a is introduced as follows: for i 5 1, . . . , I max , where x 0 a 5 x b . Thus, we propose an iterative improvement method, called NLS 5 -En4DVar, of the NLS i -En4DVar (i 5 0, 1, 2, 3) family. The NLS 5 -En4DVar approach also assumes that dx i21 0 is expressed by the linear combinations of the IPs P i21 x 5 (x 0,i21 1 , . . . , x 0,i21 N ) as follows: Substituting Eq. (25), x 0,i21 a 5 P i21 x b i21 , and the ensemble background covariance, B 5 [(P i21 x )(P i21 x ) T ]/(N 2 1), into Eq. (24) and expressing the cost function in terms of the new control variable db i21 yields Following the idea of the NLS 4 -En4DVar approach, we can obtain where P i21 y,k 5 (y 0,i21 k,1 , . . . , y 0,i21 k,N ) 5 Y k P i21 x . After the localization process, we can further obtain db i21 r 5 å S k50 (P #,i21 y,k , e . r y ) T R 21 k y 0,i21 and where Obviously, we can fix P i21 x 5 P x 5 (x 0 1 , x 0 2 , . . . , x 0 N ) during the outer loop. In this case, we have to update their corresponding OPs P i21 y,k 5 (y 0,i21 k,1 , . . . , y 0,i21 k,N ) through y 0,i21 k,j 5 L k (x i21 a 1 x 0 j )2L k (x i21 a ), which should run the forecast model N times repeatedly for each outer loop and thus lead to expensive computational costs. To address this, we can let P i21 x 5 (x 0,i21 1 , . . . , x 0,i21 N ) and (30a) and then which means that one only needs to produce the ensemble model integrations [i.e., L k (x b 1 x 0 j ), j 5 1, . . . , N] once, which could be repeatedly utilized during the outer loop in the NLS 5 -En4DVar. The advantage of this strategy is that we can approximate H 0 k M 0 k P i21 x at x i21 a without running the forecast model N times repeatedly.
We note that the above NLS 5 -En4DVar method takes the NLS 4 -En4DVar analysis solution as the starting value x 1 a and improves it through an iterative procedure. Each iteration of NLS 5 -En4DVar consists of three steps. First, the background field x i21 a is updated; that in turn calls for updates of P i21 x and P i21 y,k in the second step. Finally, an incremental db i21 r is computed by calling a quasi-NLS 4 -En4DVar scheme. Repeating this process then results in NLS 5 -En4DVar. Compared to the iterative methods NLS i - En4DVar (i 5 0, . . . , 3), NLS 5 -En4DVar has one important advantage: it does not involve the tangent model, nor does it use the (expensive) adjoint model while it can update P i21 y,k with x i21 a during the iteration process. As a result, it should be cheaper to implement this approach while achieving almost the same accuracy as NLS 1 -En4DVar, as demonstrated by the numerical tests performed in section 3a.
To summarize, NLS 0 -En4DVar provides an alternative approach (i.e., a Gauss-Newton iterative method) to the En4DVar formulation Eq. (8) without dealing with the adjoint models. This new approach is based on the idea of first transforming the original En4DVar formulation into a nonlinear least squares problem, which then is solved by a Gauss-Newton iterative algorithm. To reduce the computation complexity of the algorithm, we propose a series of simplified methods, namely, NLS i -En4DVar (i 5 0, . . . , 4), with decreasing computational complexity at the expense of reduced accuracy. It is interesting to note that these NLS i -En4DVar (i 5 0, . . . , 4) methods unify the representation of En4DVar, 4DEnVar, and LETKF into a single formulation and clearly reveal the relationship between these methods. They also provide a better understanding of these methods from a different perspective and more flexible choices for data assimilation for different situations. Furthermore, we also propose an improved iterative method (i.e., NLS 5 -En4DVar) based on the comprehensive analysis on the above NLS i -En4DVar family of methods. Particularly, an efficient localization scheme is introduced into the NLS i -En4DVar (i 5 1, . . . , 5) methods to reduce the expensive computational costs resulting from the CG iterations required for solving Eq. (15) in the original NLS 1 -En4DVar formulations.

Preliminary numerical evaluations
We design two groups of numerical experiments to evaluate the proposed NLS i -En4DVar methods. The first group of experiments is based on the Lorenz-96 model, and they are designed to better understand their inherent relationship. The second group of experiments is based on the Advanced Research WRF (ARW) Model, and they are used to further explore the potential applicability of three members of the NLS i -En4DVar family of methods to real atmosphere and/or ocean models.

a. Preliminary evaluations using the Lorenz-96 model
In this section, we use the Lorenz-96 model (Lorenz 1996) as the model platform to evaluate all the proposed methods (i.e., NLS i -En4DVar, i 5 0, 1, 2, 3, 4, 5 and En4DVar). This model has been widely used to study various issues associated with data assimilation. For example, it has been used for the comparative study of 4DVar, 4DEnKF (Fertig et al. 2007), and EnKF (Kalnay et al. 2007), as well as for evaluations of EnKF without perturbed observations (Whitaker and Hamill 2002), 4DEnKF (Hunt et al. 2004), and 4DEnVar (Tian et al. 2011). Therefore, we choose the Lorenz-96 model to investigate the advantages and disadvantages of the seven data assimilation methods discussed in the previous section.
The Lorenz-96 model is governed by the following system of nonlinear equations: with periodic boundary conditions. The model behaves quite differently with different values of F and produces chaotic systems with integer values of F larger than three. In this configuration, we take n 5 40 and F 5 8. For computational stability, a time step of 0.05 units (or 6 h in equivalent; Lorenz 1996) is adopted and a fourthorder Runge-Kutta scheme is used for temporal integration in this study. In all cases, the true states and observations were generated by the model with F 5 8.
These observations were then assimilated into models with F 5 8 and 9. The default number of observations is m 5 20 (equally spaced at every observation time). Observations were taken every two time steps (or 12 h), which were generated by adding random error perturbations with the standard deviation error of 0.1 to the time series of the true state. All experiments were carried out over 5 days. The default parameter setups are the ensemble size N 5 30, r 5 10, and the covariance localization Schür radius r 0 5 d h,0 (d y0 ) 5 8 (only one direction in the Lorenz-96 model). The length of the assimilation window is four steps and the ensemble is regenerated by the LETKF format (Hunt et al. 2004) for each cycle. The background initial field is generated by integrating the model 100 000 steps from an arbitrary nonzero field and the true initial field is subsequently obtained by integrating the 20 steps from the background initial field. Therefore, this background state is significantly different from the ''true'' state (not shown).
The initial sample is generated by running the Lorenz model from 90 steps (i.e., sampling once every 3 time steps). The performance of the NLS 0 -En4DVar approach is first examined in comparison to the original En4DVar method. In this study, we employ an L-BFGS method (Liu and Nocedal 1989) for the minimization in the En4DVar approach through Eqs. (8) and (9). Correspondingly, the Gauss-Newton algorithm with the conjugate gradient linear solver is used to solve NLS 0 -En4DVar. We compare the performance of NLS 0 -En4DVar with the original En4DVar method under the perfect-model (Fig. 1a: F 5 8 for all the truth, forecast, and assimilation runs) and the imperfectmodel (Fig. 1b: F 5 8 for the truth run and F 5 9 for the forecast and assimilation runs) assumptions. The results show that both methods perform equally satisfactorily in terms of overall low root-mean-square (RMS) errors. As noted above, NLS 0 -En4DVar is just a reformulation of the En4DVar method, so they are mathematically equivalent. The difference between the two assimilation methods is the minimization algorithms (the L-BFGS versus the Gauss-Newton algorithm) used inside both methods. Numerically, these two methods should differ on the assimilation performance because the adjoint models must be involved in En4DVar but not in NLS 0 -En4DVar.
To further examine the differences between the two iterative algorithms (L-BFGS and Gauss-Newton), we focus on their performance in the first assimilation window, and their convergence speeds in particular are examined since their RMS errors are almost identical. Figure 2 shows that the cost function values are decreasing in the iteration steps (only the first 20 steps) for the two methods under both the perfect and imperfect model assumptions. In the perfect model case, the Gauss-Newton algorithm reaches its minimum (517.156) after 5 iterations, while the L-BFGS used in the original En4DVar method needs 82 iterations to reach its minimum value (which is also 517.156). Similarly, for the imperfect model, their minimum values (both are 556.94) are reached in 8 iterations by the Gauss-Newton algorithm for NLS 0 -En4DVar and 83 iterations by the L-BFGS for the En4DVar method. To further check the influence of the iterations on the assimilation results, we also evaluate the performance of En4DVar and NLS 0 -En4DVar with I max 5 4 in Fig. 1. Obviously, the performance of En4DVar depends on the maximum iteration number and performs less satisfactory for smaller I max . As expected, the extra cost of integrating the adjoint models for 4DVar (due to the iterative algorithm based on the L-BFGS) seriously limits its applications for complicated real-world models. It should also be noted that NLS 0 -En4DVar requires series of linear operations in M 0,i21 k , which certainly adds some computational costs, although the number of iterations is significantly smaller than that of the original En4DVar method.
Second, to investigate how the proposed simplifications affect the final analysis results, we also designed another group of experiments to test the NLS i -En4DVar (i 5 0, 1, 2, 3, 5) methods. As mentioned above, we observed that their assimilation accuracy is reduced gradually from NLS 0 -En4DVar to NLS 3 -En4DVar, but this transition features decreasing computational complexity, in both the perfect and imperfect-model scenarios (Fig. 3). By comparison, NLS i -En4DVar (i 5 1, 2, 3) performs only slightly to moderately worse than NLS 0 -En4DVar does. This group of experiments illustrates that the tangent linear models (with the updated x i21 a during the iteration process) are vital to achieving higher assimilation accuracy. Remarkably, NLS 5 -En4DVar performs almost as well as NLS 1 -En4DVar and substantially better than NLS 2 -En4DVar in the perfect-model scenario. Furthermore, it shows strong competition in the imperfect-model scenario that its performance is on a par with that of NLS 0 -En4DVar [performs better than the NLS i -En4DVar (i 5 1, 2, 3) methods]. These results demonstrate that to use (or approximate) the tangent models with the updated x i a during the iteration process contributes noticeably to the accuracy of the final assimilation result.
Finally, to examine the sensitivity of the 4DEnVar assimilation performance to the iteration strategy, we also compared the performance of NLS 3 -En4DVar (a typical 4DEnVar method) using different maximum iteration numbers (I max 5 1, 3, 5). We note that NLS 3 -En4DVar with I max 5 1 (i.e., no iteration is performed) is just the NLS 4 -En4DVar method. As expected, the performance of NLS 3 -En4DVar becomes better and better as the iteration number increases (Fig. 4). Noticeably, NLS 4 -En4DVar performs worse than the other methods. These results imply that the nonlinearity has a large impact on data assimilation and should not be ignored. The socalled no-iteration approach may not be a true advantage of an assimilation method because the small savings in computational cost are obtained at the expense of much reduced accuracy.

b. Preliminary evaluations using the Advanced Research WRF (ARW) Model
The new version ARW Model (ARW3.7) is used for the experiments of this section. All experiments are done using the mesh of 100 3 120 (latitude 3 longitude) grid points with grid spacing of 30 km. The domain, which covers the region of China from 15.58-43.58N to 88.58-131.58E, with 30 layers from h 5 0 to h 5 1 in the vertical direction, is used. The experiment design used here is almost same as that in Tian and Feng (2015). The NCEP Final (FNL) Operational Model Global Tropospheric Analyses (http://rda.ucar.edu/datasets/ds083.2/) are used as the first-guess field and boundary conditions in the experiments. The ''true'' and ''background'' (referred to as Ctrl) simulated fields and the synthetic observations are produced in the same way. The true and background fields are produced from 12-and 24-h forecasts initialized by the NCEP/FNL data at 12 and 24 h prior to the analysis time, respectively. We choose a 6-h assimilation window with accumulated rainfall and traditional temperature (only on the h 5 0:563 layer) observation data available at 0, 3, and 6 h (at the observational stations; ZT) after the analysis time; the ensemble samples are also generated by the 4D moving sampling strategy (Tian and Feng 2015) and the ensemble size is 146. The horizontal localization radius used in this section is 750 km and r 5 42 (ZT). The three NLS i -En4DVar (i 5 3, 4, 5) methods are tested to examine their potential applicability to complicated realworld nonlinear weather and climate models. It should be noted here that the other three NLS i -En4DVar (i 5 0, 1, 2) methods are too difficult/costly to implement with the WRF Model. We set I max 5 3 in these experiments to test the NLS 3 -En4DVar and NLS 5 -En4DVar methods.
To evaluate the overall performance of the NLS i -En4DVar methods (i 5 3, 4, 5), we first compare the root-mean-square errors (RMSEs) of 30-h forecasts of accumulated rainfall (1-30 h after the analysis time) with the ICs from the NLS i (i 5 3, 4, 5) and Ctrl simulations, respectively. Figure 5 shows the RMSEs of 30-h forecasts of accumulated rainfall with the ICs from the assimilations; all three NLS i -En4DVar methods produce smaller values than those from Ctrl, which indicates that the three NLS i -En4DVar methods can steadily improve the rainfall forecast over 30 h. Moreover, Fig. 5 also shows that both NLS 3 -En4DVar and NLS 5 -En4DVar outperform NLS 4 -En4DVar (i.e., POD-4DVar) moderately, which illustrates the importance of the iterative strategy adopted in the two NLS i -En4DVar methods (i 5 3, 5). In particular, the improved iterative method NLS 5 -En4DVar performs slightly better than NLS 3 -En4DVar in terms of its lower RMSEs of 30-h forecasts of accumulated rainfall.
Furthermore, to investigate the performance of the two NLS i -En4DVar (i 5 3, 5) methods, we also compare the vertical profiles of RMSEs of some basic model variables from NLS i (i 5 3, 5) and Ctrl at the beginning and at the end of the assimilation window, which are calculated at all horizontal model grid points of each layer. The same conclusion can be drawn that NLS 5 -En4DVar produces more accurate atmospheric states than NLS 3 -En4DVar does. As seen in Figs. 6 and 7, NLS 5 -En4DVar yields a smaller error than NLS 3 -En4DVar does on most model layers. This further indicates that the improved iterative strategy adopted in NLS 5 -En4DVar contributes in an impactful way to its superior performance.
Additionally, to examine the impacts of the modified localization implementation in Eq. (17) to the NLS i -En4DVar methods, another version of NLS 3 -En4DVar using the CG iterations [referred to as NLS 3 -En4DVar-CG; Zhang et al. (2016)] is deliberately involved in the above comparisons. Generally, NLS 3 -En4DVar performs almost completely the same as NLS 3 -En4DVar-CG does in terms of the RMSEs (not shown), which illustrates that computational efficiency without the use of the CG iterations in NLS 3 -En4DVar is not achieved at the expense of the assimilation accuracy. All of the above numerical experiments are conducted serially in a DELL M620 server with two CPUs and fourthgeneration broadband cellular network (4G) memories.
The CPU times, which are required to accomplish their pure assimilation experiments (not including the forecast model run time), for the NLS i -En4DVar (i 5 3, 4, 5) and NLS 3 -En4DVar-CG methods, are compared in Table 2. The results show that the improved iterative strategy adopted in NLS 5 -En4DVar adds almost no (i.e., negligible) computational cost, which is also expected because the NLS 5 -En4DVar reformulation does not add any additional forecast model runs. Furthermore, the comparison between the two versions of NLS 3 -En4DVar illustrates that the introduction of the modified localization scheme without the CG iterations indeed results in a significant speed up. In addition, the CPU time for running the forecast model (i.e., the WRF model) once is about 10.3 min. We point out that the CPU times are always case and machine dependent.

Summary and conclusions
The only difference between En4DVar and the traditional 4DVar is that En4DVar uses the ensemble background covariance matrix B e provided by the parallel EnKF implementation to replace the usual static, flow-independent covariance matrix B in the traditional 4DVar scheme. The inclusion of the matrix B e enables (or facilitates) En4DVar to benefit from the flowdependent strength of EnKF. However, the adjoint model is still required inside the minimization procedure in En4DVar. An alternative strategy is exploited in this study with a focus on circumventing the obstacle of the adjoint models. To accomplish this goal, we first transform the standard En4DVar into a nonlinear least squares problem. Subsequently, a Gauss-Newton iterative algorithm is proposed to iteratively approximate the optimal nonlinear 4DVar solution (without invoking the adjoint models), which defines NLS 0 -En4DVar with no dependency on the adjoint models. To reduce the computational and implementation costs, the Gauss-Newton iterative algorithm is further simplified with different levels of accuracy and computational complexity, which provides more flexible choices for data assimilation during different situations. The advantages and disadvantages of the proposed NLS i -En4DVar methods and their inherent relationship are demonstrated by several groups of numerical evaluation experiments with the Lorenz-96 and Advanced Research version of WRF (ARW) models, respectively. Our main conclusions are summarized as follows: d The proposed NLS i -En4DVar family of methods provides a unified formulation of En4DVar, 4DEnVar, and LETKF for better understanding their relationship and a new perspective for future research on data assimilation.
d NLS 0 -En4DVar provides an alternative approach to the En4DVar approach by utilizing explicitly an iterative algorithm rather than resorting to any existing optimization algorithm. Unlike the usual En4DVar method, which involves performing several evaluations of the cost function and its gradient with the L-BFGS algorithm, the adjoint models are tactically avoided in the proposed Gauss-Newton algorithm. Moreover, the added computational cost could be compensated for by its ease of parallel computing.
d While looking forward in the future, NLS 1 -En4DVar appears to be a very promising and competitive option FIG. 6. Vertical profiles of RMSEs of (a) zonal wind (m s 21 ), (b) meridional wind (m s 21 ), (c) temperature (8C), and (d) water vapor mixing ratio (g kg 21 ) of the NLS 3 and NLS 5 methods at the start of the assimilation window 90 because of its moderate difficulty in computer implementation (it only requires the tangent linear models as NLS 0 -En4DVar does) and its significantly reduced computational costs, which all are achieved with only a small reduction in the convergence speed compared to the original NLS 0 -En4DVar. d Compared with NLS 1 -En4DVar, NLS 5 -En4DVar is quite advantageous and has great potential because of its ease of implementation and the absence of any need to invoke the tangent and adjoint models, as well as its superior performance.
d The iteration strategy adopted in the proposed NLS i -En4DVar family of methods is expected to significantly improve the quality of the data assimilation results because of its ability to handle the strong nonlinearity of the models, which is common in data assimilation. FIG. 7. As in Fig. 6, but at the end of the assimilation window.