In variational data assimilation (VDA) for meteorological and/or oceanic models, the assimilated fields are deduced by combining the model and the gradient of a cost functional measuring discrepancy between model solution and observation, via a first-order optimality system. However, existence and uniqueness of the VDA problem along with convergence of the algorithms for its implementation depend on the convexity of the cost function. Properties of local convexity can be deduced by studying the Hessian of the cost function in the vicinity of the optimum. This shows the necessity of second-order information to ensure a unique solution to the VDA problem.
In this paper a comprehensive review of issues related to second-order analysis of the problem of VDA is presented along with many important issues closely connected to it. In particular issues of existence, uniqueness, and regularization through second-order properties are examined. The focus then shifts to second-order information related to statistical properties and to issues related to preconditioning and optimization methods and second-order VDA analysis. Predictability and its relation to the structure of the Hessian of the cost functional is then discussed along with issues of sensitivity analysis in the presence of data being assimilated. Computational complexity issues are also addressed and discussed.
Automatic differentiation issues related to second-order information are also discussed along with the computational complexity of deriving the second-order adjoint.
Finally an application aimed at illustrating the use of automatic differentiation for deriving the second-order adjoint as well as the Hessian/vector product applied to minimizing a cost functional of a meteorological problem using the truncated-Newton method is presented. Results verifying numerically the computational cost of deriving the second-order adjoint as well as results related to the spectrum of the Hessian of the cost functional are displayed and discussed.
Data assimilation can be described as the ensemble of techniques for retrieving geophysical fields from different sources such as observations, governing equations, statistics, etc.
Being heterogeneous in nature, quality, and density, these data sources have to be put together to optimally retrieve (the meaning of “optimal” has to be precisely defined) the geophysical fields. Due to its inherent operational tasks, meteorology has played an important role in the development of data assimilation techniques. An ever-increasing amount of data and models are considered as an ensemble from which the optimal information should be extracted. Behind most of the classical methods used in meteorology such as optimal interpolation, variational methods, statistical estimation, etc., there is a variational principle, that is, the retrieved fields are obtained through the minimization of some functional depending on the various sources of information. The retrieved fields are obtained through some optimality condition, which can be an Euler or Euler–Lagrange condition if the regularity conditions are satisfied. Since these conditions are first-order conditions, it follows that they involve the first-order derivatives of the functional that is minimized. In this sense, data assimilation techniques are first-order methods. But first-order methods are only necessary conditions for optimality but not sufficient ones. To obtain sufficient conditions we need to proceed one step further and to introduce second-order information. By the same token, from the mathematical point of view sensitivity studies with respect to some parameter can be obtained through Gâteaux derivatives with respect to this parameter. Therefore if we seek the sensitivity of fields that have already been defined through some first-order conditions, we will have to proceed one order of derivation further and in this sense our sensitivity studies require second-order information.
The purpose of this review paper is to show how to obtain and how to use in an efficient way second-order information in data assimilation. In a first part we will show how the second-order derivative can be computed, primarily in a very general framework, then illustrate it with some examples. Then we will show how this second-order information can be linked to the issues of uniqueness of a solution to the problem of data assimilation. This will be shown to be not only a mathematical consideration but also rather a practical issue whereby information can be extracted by studying second-order information.
In a second part of the paper we will proceed to show how to derive sensitivity analyses from models and data. The analysis of the impact of uncertainties in the model and in the data provides essential links between purely deterministic methods (such as variational data assimilation) and stochastic methods (Kalman filter–type). We will then proceed to demonstrate how the link between these methods can be clearly understood through use of second-order information.
Researchers in other disciplines have carried out pioneering work using second-order information. Work in seismology using second-order information and applying it to obtain accurate Hessian/vector products for truncated-Newton minimization was carried out by Santosa and Symes (1988, 1989) and by Symes (1990, 1991, 1993). Reuther (1996) and Arian and Ta'asan (1999) illustrated the importance of second-order adjoint analysis for optimal control and shape optimization for inviscid aerodynamics. Hou and Sheen (1993) used second-order sensitivity analysis for heat conduction problems.
Second-order information was tackled in automatic differentiation (AD) by Abate et al. (1997), Giering and Kaminski (1998a,b), Gay (1996), Hovland (1995), Bischof (1995), Burger et al. (1992), Griewank and Corliss (1991), and Griewank (1993, 2000, 2001), to cite but a few. Several AD packages such as the tangent linear and adjoint model compiler (TAMC) of Giering and Kaminski (1998a) allow calculation of the Hessian of the cost functional.
Early work on second-order information in meteorology includes Thacker (1989) followed by work of Wang et al. (1992, 1993) and Wang (1993). Wang et al. (1995, 1998) considered use of second-order information for optimization purposes namely to obtain truncated-Newton and adjoint Newton algorithms using exact Hessian/vector products. Application of these ideas was presented in Wang et al. (1997).
Kalnay et al. (2000) introduced an elegant and novel pseudo-inverse approach and showed its connection to the adjoint Newton algorithm of Wang et al. (1997) (see Kalnay et al. 2000; Pu and Kalnay 1999; Pu et al. 1997).
Ngodock (1996) applied second-order information in conjunction with sensitivity analysis in the presence of observations and applied it to the ocean circulation. Le Dimet et al. (1997) presented the basic theory for second-order adjoint analysis related to sensitivity analysis. A condensed summary of the theory is presented in Le Dimet and Charpentier (1998).
The structure of the paper is as follows. Section 2 deals with the theory of the second-order adjoint method, both for time-independent and time-dependent models. The methodology is briefly illustrated using the shallow-water equations model. Section 3 deals with the connection between sensitivity analysis and second-order information. Section 4 briefly presents the Kalnay et al. (2000) quasi-inverse method and its connection with second-order information. Issues related to second-order Hessian information in optimization theory are addressed in section 5. Both unconstrained and constrained minimization issues are briefly discussed. Finally the use of accurate Hessian/vector products to improve the performance of the truncated Newton method are presented along with the adjoint truncated-Newton method. A method for approximating the Hessian of the cost function with respect to the control variables proposed by Courtier et al. (1994), based on rank-p approximation and bearing similarity to approximation of the Hessian in quasi-Newton methods (see Davidon 1959, 1991), is presented in section 5d.
Section 6 is dedicated to methods of obtaining the second-order adjoint via AD technology, while issues of computational complexity of AD for the second-order adjoint are presented in the appendix. Use of the Hessian of the cost functional to estimate error covariance matrices is briefly discussed in section 7. The use of Hessian singular vectors used for development of a simplified Kalman filter is addressed briefly in section 8.
Finally as a numerical illustration we present in section 9 the application of the second-order adjoint of a limited area model of the shallow-water equations to obtain an accurate Hessian/vector product compared to an approximate Hessian vector product obtained by finite differences. Automatic differentiation is implemented using the adjoint model compiler TAMC. The Hessian/vector information is used in a truncated-Newton minimization algorithm of the cost functional with respect to the initial conditions taken as the control variables and its impact versus the Hessian/vector product obtained via finite differences is assessed. The numerical results obtained verify the theoretically derived computational cost of obtaining the second-order adjoint via automatic differentiation. The Arnoldi package (ARPACK) was then used in conjunction with the second-order adjoint to gain information about the spectrum of the Hessian of the cost function. The unified notation of Ide et al. (1997) for data assimilation will be used.
Summary and conclusions are finally presented in section 10.
2. Computing the second-order information
In this section we will deal with deterministic models while the case of stochastic modeling will be discussed later in this manuscript.
In general we will assume that the model has the general form
where X, the state variable, describes the state of the environment and U is the input of the model, that is, an initial condition that has to be provided to the model to obtain from Eq. (1) a unique solution X(U). We will assume that X and U belong to a space equipped with an inner product.
The closure of the model is obtained through a variational principle that can be considered as the minimization of some functional:
For instance, in the case of variational data assimilation, J may be viewed as representing the cost function measuring the discrepancy between the observation and the solution associated with the value U of the input parameter. Therefore the optimal input for the model will minimize J.
a. First-order necessary conditions
If the optimal U minimizes J, then it satisfies the Euler equations given by
where ∇J is the gradient of J with respect to control variables.
The gradient of J is obtained in the following way:
We compute the Gâteaux (directional) derivative of the model and of F in some direction u. We may write
where (^) stands for the Gâteaux derivative. Let Z be an application from ℝn into ℝn with variable U. We define the Gâteaux derivative of Z in the direction u when this limit exists. For a generic function Z it is given by
If Ẑ(U, u) is linear in u, we can write
where ∇Z is the gradient of Z with respect to U, and 〈 · , · 〉 stands for the inner product. The Gâteaux derivative is also called a directional derivative.
Here, ∂ℱ/∂X (or ∂ℱ/∂U) is the Jacobian of ℱ with respect to X (or U) and
The gradient of J is obtained by exhibiting the linear dependence of Ĵ with respect to u. This is done by introducing the adjoint variable P (to be defined later according to convenience).
Taking the inner product between (4) and P yields
Therefore using (6), if P is defined as the solution of the adjoint model,
then we obtain
b. Second-order adjoint
To obtain second-order information, we seek the product of the Hessian matrix 𝗚(U) of J with some vector u. As before we apply a perturbation u to Eqs. (1) and (9), and from Eqs. (9) and (10) we obtain
We introduce here Q and R, two additional variables. To eliminate X̂ and P, we will take the inner product of Eqs. (4) and (11) with Q and R, respectively; then we add the results. We then obtain
Let us take the inner product of Eq. (12) with u; then we may write
From (13) we get
Therefore if Q and R are defined as being the solution of
then we obtain
For Eqs. (14)–(15) we took into account the symmetry of the second derivative, for example,
leading to some simplifications.
The system (16)–(17), which has to be solved to obtain the Hessian/vector product, can be derived from the Gâteaux derivative (4), which is the same as (17). In the literature, the system (16)–(17) is often called the tangent linear model, this denomination being rather inappropriate because it implies the issue of linearization and the subsequent notion of range of validity, which is not relevant in the case of a derivative.
In the case of an N-finite dimensional space the Hessian can be fully computed after N integrations of vector components ei of the canonical base.
Equation (16) differs from the adjoint model by the forcing terms, which will depend on u and R.
The system (16)–(18) will yield the exact value of the Hessian/vector product. An approximation could be obtained by standard finite differences, that is,
where α is the finite-difference interval that has to be judicially chosen. In the incremental three- and four-dimensional variational data assimilation (3D-/4DVAR) approach the Hessian can readily be obtained by differencing two gradients.
However several integrations of the model and of its adjoint model will be necessary in this case to determine the range of validity of the finite-difference approximation (Wang 1995 and references therein).
d. Time-dependent model
In the case of variational data assimilation the model ℱ is a differential system on the time interval [0, T]. The evolution of X between 0 and T is governed by the differential system:
Here X belongs to the Hilbert space ℍ, which is a subspace of the n-dimensional Cartesian product space [𝔸(0, T)]n.
The input variable is often the initial condition:
In this system ℱ is a nonlinear operator that describes the dynamics of the model; V ∈ 𝒱 ⊂ [𝔸(0, T)]m is a term used to represent the uncertainties of the model, which we assume to be linearly coupled through the (n, m)-dimensional matrix 𝗕; U is the initial condition; and the criteria J is the discrepancy between the solution of (20)–(21) and observations:
where H is the observation matrix, that is, a linear operator mapping X into 𝕆obs. The problem consists in determining U and V that minimize J.
A perturbation v on V and u on U gives X̂ and Ĵ the Gâteaux derivatives of X and J as a solution of
Let us introduce P, the adjoint variable; we take the product of (23) with P after a summation on the interval [0, T] and an integration by parts followed by identification of linearities with respect to u and v in (25); next we conclude that if P is defined as the solution of the adjoint model
then the components of the gradient ∇J with respect to U and V are
Because V is time dependent, its associated adjoint variable Q will be also time dependent. Let us remark that the gradient of J with respect to V will depend on time, which is not surprising since J also depends on time. From a computational point of view the discretization of V will have to be carried out in such a way that the discretized variable remains in a space of “reasonable” dimension.
The second derivative will be derived after a perturbation H on the control variables U and V:
The Gâteaux derivatives X̂, P of X, and P in the direction of H are obtained as the solution of the coupled system:
We then introduce Q and R, second-order adjoint variables. They will be defined later for ease in use of presentations.
The terms in P̂ and X̂ are collected and after integration by parts and some additional transformations we obtain
Let 𝗚 be the Hessian matrix of the cost J. We then have
Therefore if we define the second-order adjoint as being the solution of
then it remains
We would like to point out that Eq. (44) follows directly from Eq. (43) by using Eq. (42). The product of the Hessian by a vector is obtained exactly by a direct integration of (40) and (42) followed by a backward integration in time of (39) and (41).
One can obtain 𝗚 by n integrations of the differential system:
with the conditions
where ei are the n vectors of the canonical base of ℝn thus obtaining
One then integrates m times the differential system:
with terminal and, respectively, initial conditions
where fj are the m canonical base vectors of ℝm obtaining
The system defined by these equations is the second-order adjoint model. The Hessian matrix is obtained via n + m integrations of the second-order adjoint. The second-order adjoint is easily obtained from the first-order adjoint; differing from it only by some forcing terms, in particular the second-order term. The second equation is that of the linearized model (the tangent linear model).
One can also obtain the product of a vector of the control space, times the Hessian at the cost of a single integration of the second-order adjoint.
e. Example: The shallow-water equations
The shallow-water equations (SWEs) represent the flow of an incompressible fluid whose depth is small with respect to the horizontal dimension.
The SWEs can be written in a Cartesian system:
In this system of equations X = (u, υ, ϕ)T is the state variable, u and υ are the components of the horizontal velocity, ϕ is the geopotential, and f the Coriolis parameter. We aim to present this example in order to provide a didactic setup, thus we will make the strongest simplifications.
We neglect the model error, which following the previous notation, implies 𝗕 ≡ 0. We only control the initial conditions.
We impose periodic boundary conditions.
We derive directly the tangent linear model (TLM). The barred variables X = (u, υ, ϕ)T are the directional derivatives in the direction of the perturbation h = (hu, hυ, hϕ)T applied to the initial condition and we obtain
By transposing the TLM we obtain the adjoint model. Let P = (ũ, υ̃, ϕ̃)T be the adjoint variable, then the adjoint model satisfies
To obtain the second-order model we linearize the couple direct model and adjoint model; we then transpose and obtain the second-order adjoint variable Q = (û, υ̂, ϕ̂)T and the variable R = (u, υ, ϕ)T defined by the TLM:
We see that formally the first- and second-order adjoint models differ only by second-order terms, which contain the adjoint variables.
The calculation of second-order derivatives requires the storage of the model trajectory, the tangent linear model, and the adjoint model.
3. Sensitivity analysis and second-order information
a. General sensitivity analysis
In general a model has the following kinds of variables:
(i) State variable: 𝗭 in a space 𝕃, which describes the physical properties of the medium (velocity, pressure, temperature, …); 𝗭 depends on time and space.
(ii) Input variable: 𝗜 in a space 𝕃, which has to be provided to the model (e.g., initial or boundary conditions), most of the time these variables are not directly measured but they can be estimated through data assimilation.
(iii) Parameters: 𝗞 represents empirical parameters (e.g., diffusivity) that most models contain, which have to be tuned to adjust the model to the observations.
From the mathematical point of view a model is written as
where ℱ′ is some PDE operator (depending on time or being steady state) or its discrete counterpart. We assume that if 𝗜 and 𝗞 are given, then the model has a unique solution 𝗭(𝗜, 𝗞).
In many applications sensitivity analysis is carried out; for instance if we consider some scalar quantity linked to a solution of the model, what will be its variation if there is some perturbation on the inputs of the model?
From the formal point of view a sensitivity analysis is defined by a so-called response function 𝗚: 𝕃 → ℝ, depending on the state variable X (and therefore indirectly depending on 𝗜 and 𝗞). By definition the sensitivity of 𝗚 with respect to 𝗞 (respectively 𝗜) is the gradient of 𝗚 with respect to 𝗞 (respectively 𝗜):
There are two ways to estimate the sensitivity.
b. Sensitivity analysis via finite differences
Assume we are looking for the sensitivity with respect to 𝗞 in a finite space of dimension N. Let ki, 1 ≤ i ≤ N be the components of 𝗞, then
and estimation of ∂𝗚/∂ki can be carried out by computing
where ki are the canonical base vectors. This procedure for estimating a sensitivity is simple. Nevertheless it can be costly if N is large since it will require as many integrations of the model as N. Furthermore, (70) gives only one approximation of the gradient. The scalar α has to be chosen such that the model has a linear behavior with respect to α. The determination of α may require several integrations of the model for each value of ki. The main advantage of this method is that it does not require important software development.
c. Sensitivity via the adjoint
Because the sensitivity is a gradient, the adjoint variable may be used to derive it. Let i and k be perturbations on 𝗜 and 𝗞. The derivation of (69) leads to
and for the response function,
The gradient will be obtained by exhibiting the linear dependence of Ĝ with respect to i and k. Let us introduce Π, an adjoint variable of the same dimension as 𝗭.
Taking the inner product of (71) with Π gives
By identification to (72), if the adjoint model is defined as the solution of
then the sensitivities SI with respect to 𝗜 (respectively SK, with respect to 𝗞) are given by
It is worth noting that the sensitivity is obtained only after one run of the adjoint model and the result is exact. The cost to be paid is in software development since an adjoint model has to be developed.
d. Sensitivity analysis and data assimilation
Previously we have assumed that the input parameters of the model are known. In fact they are indirectly derived from observations through a process of data assimilation. If a variational data assimilation procedure is carried out, and if X is a state variable, 𝗜 the input, ℱ the model, and P the adjoint variable, then X and P are solutions of the following optimality system (OS), J(X, 𝗜) being the cost function:
For sensitivity studies in the presence of observations, with a given response function we have to consider the OS as a generalized model ℱ′ with a state variable 𝗭 = (X, P)T, and a general sensitivity analysis has to be applied to this general model. Therefore the adjoint of the optimally system has to be derived.
After a perturbation i on 𝗜, we may write
and for the response function 𝗚 defined earlier we get
Here, X̂ and P̂ are the Gâteaux derivatives of X and P in the direction i. After second-order adjoint variables Q and R are introduced, we take the inner product of (74) and (75) by Q and (76) by R; then we add the three equations and we may write
By identification in (77) it follows that if Q and R are defined as solution of
then we get the gradient of G with respect to 𝗜, that is the sensitivity
Therefore the algorithm to get the sensitivity is as follows:
The sensitivity in the presence of observations requires taking into account the second-order information. A very simple example given by Le Dimet et al. (1997) clearly shows the necessity of the introduction of this term.
4. Kalnay et al. (2000) quasi-inverse method and second-order information
The inverse 3DVAR method proposed by Kalnay et al. (2000) is introduced by considering a cost functional:
where δx is the difference between the analysis and the background at the beginning of the assimilation window, 𝗟 and 𝗟T are the TLM and its adjoint, and H is the tangent linear version of the forward observation operator ℋ. In addition, 𝗕 is the forecast error covariance and 𝗥 is the observational error covariance.
Taking the gradient of J with respect to the initial change δx = xa − xb, where xa and xb are the analysis and first guess, respectively, we obtain
In an adjoint 4DVAR an iterative minimization algorithm such as the quasi-Newton or conjugate gradient is employed to obtain the optimal perturbation:
here i is the minimization counter, where αi is the step size in the minimization algorithm.
One stops after a number of minimization iterations when ‖∇J‖ is small enough to satisfy a convergence criterion.
In order to determine the optimal value of the step size, the minimization algorithm, say quasi-Newton, requires additional computations of the gradient ∇Ji−1, so that the number of direct and adjoint integrations required by adjoint 4DVAR can be larger than the number of minimization iterations (see Kalnay et al. (2000).
The inverse 3DVAR approach of Kalnay seeks to obtain directly the “perfect solution,” that is, the special δx that makes ∇J = 0, provided δx is small.
Eliminating in (81) the adjoint operator one gets
Since we have the quasi-inverse model obtained by integrating TLM backward, that is, a good approximation of 𝗟−1, we obtain
Wang et al. (1998) proposed an adjoint Newton algorithm that also required the backward integration of the tangent linear model and proposed a reformulation of the adjoint Newton when the TLM is not invertible. They did not explore this idea in depth. Physical processes are generally not parameterized in a reversible form in atmospheric models—a problem that can be only overcome to some extent by using simplified reversible physics. Also truly dissipative processes in atmospheric models are not reversible and as such will constitute a problem for the inverse 3DVAR. To show the link of inverse 3DVAR to second-order information we follow Kalnay et al. (2000) to show that inverse 3DVAR is equivalent to using a perfect Newton iterative method to solve the minimization problem at a given time level.
If we look for the minimum of the cost functional at x + δx given that our present estimate of the solution is x, then expanding ∇J(x + δx) in a Taylor series to second term yields
where ∇2J(x) is the Hessian matrix.
The Newton iteration is
For the cost function (80) the Hessian is given by
A first iteration with the Newton minimization algorithm yields
which is identical with the inverse 3DVAR solution.
Since cost functions used in 4DVAR are close to quadratic functions, one may view 3DVAR as a perfect preconditioner of a simplified 4DVAR problem.
In general, availability of second-order information allows powerful minimization algorithms to be used (Wang et al. 1995, 1997) even when the inverse 3DVAR is difficult to obtain as is the case with full physics models.
5. Hessian information in optimization theory
Hessian information is crucial in many aspects of both constrained and unconstrained minimization. All minimization methods start by assuming a quadratic model in the vicinity of the minimum of a multivariate minimization problem.
For the problem
the necessary condition for X* to be a stationary point is
and in order to obtain sufficient conditions for the existence of the minimum of the multivariate unconstrained minimization problem, we must require that the Hessian at X* is positive definite.
a. Spectrum of the Hessian and rate of convergence of unconstrained minimization
The eigenvalues of the Hessian matrix predict the behavior and convergence rate for unconstrained minimization. To show this, let us consider again the multivariate nonlinear function ℱ(X) of (89) and let X* denote a local minimizer point that satisfies the condition
for all X such that
where ε is typically a small positive number whose value may depend on the value of X*. We define ℱ(X*) as an acceptable solution of (89).
If ℱ is twice continuously differentiable, and X* is an absolute minimum, then
and the Hessian 𝗚(X*) of ℱ at X* is positive definite, that is,
Let us expand ℱ in a Taylor series about X*:
For any acceptable solution we obtain
such that the condition number of the Hessian 𝗚(X*) substantially affects the size of ‖X − X*‖, that is, the rate of convergence of the unconstrained minimization (Gill et al. 1981).
If 𝗚(X*) is ill-conditioned, the error in X will vary with the direction of the perturbation p.
If p is a linear combination of eigenvectors of 𝗚(X*) corresponding to the largest eigenvalues, the size of ‖X − X*‖ will be relatively small, while if, on the other hand, p is a linear combination of eigenvectors of 𝗚(X*) corresponding to the smallest eigenvalues, the size of ‖X − X*‖ will be relatively large; that is, there will be slow convergence.
b. Role of the Hessian in constrained minimization
The Hessian information plays a very important role in constrained optimization as well. We shall deal here with optimality conditions where again Taylor series approximations are used to analyze the behavior of the objective function ℱ and constraints hi about a local constrained minimizer X*.
We shall consider first optimal conditions for linear equality constraints.
The problem is cast as
where 𝗔 is an m × n matrix, m ≤ n. We assume that ℱ is twice continuously differentiable and that rows of 𝗔 are independent, that is, 𝗔 has full row rank. The feasible region consists of the set of points satisfying all the constrains.
Any problem with linear constraints 𝗔X = b can be recast as an equivalent unconstrained problem. Assume we have a feasible point X; that is,
Then any other feasible point can be expressed as
where p is a feasible direction.
Any feasible direction must lie in the null space of 𝗔, the set of vectors p satisfying
Denoting the null space of 𝗔 by ℕ(𝗔), the feasible region is given by
Let 𝗭 be the null space matrix for 𝗔 of dimension n × r with r ≥ n − m. Then the feasible region is given by
Then the constrained minimization problem in X is equivalent to the unconstrained problem:
where X is a feasible point (Gill et al. 1981; Nash and Sofer 1996). The function Φ is the restriction of ℱ onto the feasible region, called the reduced function. If 𝗭 is a basis matrix for the null space of 𝗔, then Φ is a function of n − m variables. The constrained problem has been transformed into an unconstrained problem with a reduced number of variables.
Optimality conditions involve derivatives of the reduced function. If X = X + 𝗭v
is called the reduced gradient of ℱ at X. Similarly the matrix
is called the reduced or projected Hessian matrix.
The reduced gradient and Hessian matrix are the gradient and Hessian of the restriction of ℱ onto the feasible region evaluated at X. If X* is a local solution of the constrained problem, then
and v* is the local minimizer of Φ. Hence we can write
and ∇2Φ(v*) is positive semidefinite.
Here we briefly present in the framework of optimality conditions for linear equality constraints the necessary conditions for a local minimizer. If X* is a local minimizer of ℱ and 𝗭 is the null-space matrix for 𝗔, then
and 𝗭T∇2ℱ(X*)𝗭 is positive semidefinite. That is, the reduced gradient is zero and the reduced Hessian matrix is positive semidefinite. (The second-order derivative information is used to distinguish local minimizers from other stationary points.) The second-order condition is equivalent to the condition
Noting that p = 𝗭v is a null-space vector, we can rewrite (107) as
that is, the Hessian matrix at X* must be positive semidefinite on the null space of 𝗔.
c. Application of second-order-adjoint technique to obtain exact Hessian/vector product
We will exemplify this application by considering a truncated-Newton algorithm for large-scale minimization.
1) Description of Truncated-Newton methods
Truncated-Newton methods are used to solve the problem
using a conjugate-gradient (CG) iterative method; we note here that Newton equations are a linear system of the form
The conjugate gradient method is “truncated” before the exact solution to the Newton equations has been found. The CG method computes the search direction and requires storage of a few vectors.
The only obstacle for using minimization is the requirement that it computes Hessian matrix/vector products of the type
for arbitrary vectors v. One way to bypass the storage difficulty is to approximate the Hessian matrix/vector products using values of the gradient in such a way that the Hessian matrix need not be computed or stored. Using Taylor series
that is, we approximate the matrix/vector product
for some small values of h.
The task of choosing an adequate h is an arduous one (see Nash and Sofer 1996, chapter 11.4.1 and references therein). For in-depth descriptions of the truncated-Newton (also referred to as the Hessian-free) method, see Nash (1984a–d, 1985) and Nash and Sofer (1989a,b), as well as Schlick and Fogelson (1992a,b), and early work by Dembo et al. (1982) and Dembo and Steihaug (1983). A comparison of limited memory quasi-Newton (see Liu and Nocedal 1989) and truncated-Newton methods is provided by Nash and Nocedal (1991), while a comprehensive well-written survey of truncated-Newton methods is presented in Nash (2000). A comparison between limited memory quasi-Newton and truncated-Newton methods applied to a meteorological problem is described in depth by Zou et al. (1990, 1993).
d. A method for estimating the Hessian matrix
The cost function measuring the misfit between the forecast model solution and available observations distributed in space and time may be expressed as
For the sake of simplicity we choose R = 1, which yields
where 𝗕 is an observation operator, X(tr) the vector of model control variables, Xobs(tr) the vector of observational data at time t = tr, t = tN is the final time of model integration, and W(tr) is the inverse of the observation covariance matrix.
The operator of model integration from time t = t0 to t = tN is
At the minimum Xmin, the gradient of the cost function ∇J vanishes.
If we introduce random variables η(t0) and η(tN), with zero expectations and whose covariances are the diagonal elements of 𝗪−1(t0) and 𝗪−1(tN), respectively, to the observations
then ∇J(at X*min) is a random variable and we obtain
where E[·] is the mathematical expectation and 𝗝″ is the Hessian matrix. We can see that we obtain an outer vector product expression, which is a rank-one matrix.
For each realization i of Xobs1(t0) and Xobs1(tN) we can calculate ∇Ji at Xmin and after p such realizations we obtain at most a rank p approximation of the Hessian of the cost function (Yang et al. 1996; Rabier and Courtier 1992; Courtier et al. 1994):
This approach is analogous to quasi-Newton methods where symmetric rank-one or rank-two updates are collected to update approximations of the Hessian or the inverse of the Hessian matrix as the minimization proceeds. As shown by Yang et al. (1996) use of the approximate 𝗝″p as a preconditioner is extremely efficient. Forsythe and Strauss (1955) have already shown that using the diagonal of the Hessian is optimal among all diagonal preconditioning methods.
6. Second-order adjoint via automatic differentiation
There is an increased interest in obtaining the second-order adjoint via automatic differentiation (AD).
Research work has been carried out in the recent version of the FORTRAN TAMC AD package designed by Giering and Kaminski (1998a) allowing for both the calculation of Hessian/vector products as well as for the more computationally expensive derivation of the full Hessian with respect to the control variables. Comparable CPU times to those required for hand coding were reported (Giering and Kaminski 1998b).
The importance of the Hessian/vector products derived by AD is particularly important in minimization where there is often interest not only in the first but rather in the second derivatives of the cost functional, which convey crucial information.
Griewank (2000) estimated the computational complexity of implementing second-order adjoints in a thorough manner.
He found that for calculating Hessian/vector products an effort leading to a run-time ratio of about a factor of 13 was required.
The calculation of the ratio between the effort required to obtain Hessian/vector products and that required to calculate the gradient of the cost was found to be a factor between 2 and 3 only.
All analytic differentiation methods are based on the observation that most vector functions f are being evaluated as a sequence of assignments:
Here variables υi are real scalars and the elemental functions ϕi are either binary arithmetic operations or univariate intrinsics.
Consequently, only one or two of the partial derivatives,
do not vanish identically.
Without loss of generality we may require that the first n variables
represent independent variables and that the last m variables
represent dependent variables.
In AD calculation of f′(x), it can be represented by a sparse triangular matrix
This 𝗖 can also be interpreted as a computational graph whose vertices are elemental functions ϕi and the edges are the nonvanishing partial cij.
Exploiting sparsity for AD calculation of the second-order adjoint Griewank (2000) shows that economy can be realized when the computational Hessian graph symmetry allows the AD-computed Hessian to assume the form
which leads to a dyadic representation first put forward in a paper by Jackson and McCormick (1988); see below Eqs. (125)–(126).
The derivation originates in the approach put forward by Griewank (2001) of Jacobian accumulation procedures using implicit function theorem.
Griewank (2000) derives a class of derivative accumulation procedures as edge eliminations on the linearized computational Hessian graph.
Functions defined by an evaluation procedure can be characterized by a triangular system of nonlinear equations E(X, υ) = 0. Applying the implicit function theorem one obtains the derivative of ∂υ/∂x. The (n + i)-th row of ∂υ/∂x represents exactly the gradient
of the ith intermediate value υi with respect to all independent variables:
denotes the matrix formed by the gradients
of all intermediates υi, with respect to the independents x ∈ ℝn, which is computed during a forward sweep to calculate the gradient
Then the Hessian takes the product form
where 𝗦 ∈ ℝ(l−m+n)×(l−m+n) is zero except for diagonal elements of the form υjϕj and diagonal blocks of the form
(ϕj being a nonlinear elemental).
One can show that first-order derivatives form the nonzero elements of matrix Ż.
results in a sum of outer products called dyadic representation, which was used extensively by Jackson and McCormick (1988), who referred to the functions defined by the evolution procedures as “factorable.”
υi are intermediates with respect to independents x ∈ ℝn.
7. Use of Hessian of cost functional to estimate error covariance matrices
A relationship exists between the inverse Hessian matrix and the analysis error covariance matrix of either 3DVAR or 4DVAR (see Thacker 1989; Thepaut and Courtier 1991; Rabier and Courtier 1992; Yang et al. 1996; Le Dimet et al. 1997).
Following Courtier et al. (1994) we consider methods for estimating the Hessian in the weakly nonlinear problem when the tangent linear dynamics is a good approximation to nonlinear dynamics. As a consequence the cost function is near to being quadratic. If as in Gauthier (1992) we consider the observations as random variables and we look at variational analysis as attempting to solve the minimization problem
where xb is the unbiased background field and y the set of unbiased observations, both being realizations of random variables of covariances 𝗕 and 𝗢, respectively, and where the operator 𝗛 computes the model equivalent ℋx of the observation y, then the Hessian 𝗝″ of the cost function J at the minimum is given by
obtained by differentiating (128) twice.
Moreover the analysis error covariance matrix is the inverse of the Hessian as shown in appendix B of Rabier and Courtier (1992). Calling xa the result of the minimization (i.e., the analysis) and xt the truth, one sees that the error covariance at the minimum is
8. Hessian singular vectors
Computing Hessian singular vectors (HSVs) uses the full Hessian of the cost function in the variational data assimilation, which can be viewed as an approximation of the inverse of the analysis error covariance matrix and it is used at initial time to define a norm. The total energy norm is still used at optimization time. See work by Barkmeijer et al. (1998, 1999). The HSVs are consistent with the 3DVAR estimates of the analysis error statistics. They are also defined in the context of 4DVAR. In practice one never knows the full 3DVAR Hessian in its matrix form and a generalized eigenvalue problem is solved as will be described below.
The HSVs are also used in a method first proposed by Courtier (1993) and tested by Rabier et al. (1997) for the development of a simplified Kalman filter fully described by Fisher (1998) and compared with a low-resolution explicit extended Kalman filter by Ehrendorfer and Bouttier (1998).
Let 𝗠 be the propagator of the tangent linear model, and 𝗣 a projection operator setting a vector to zero outside a given domain.
Consider positive-definite and symmetric operators including a norm at initial and optimization time, respectively.
Then the HSVs defined by
under an Euclidean norm are the solution of the generalization eigenvalue problem. Here the positive-definite and symmetric operators 𝗖 and 𝗘 induce a norm at initial and optimization time, respectively. Usually the total energy metric is used and then 𝗖 and 𝗘 are identical:
The adjoint operators 𝗠* and 𝗣* are determined with respect to the Euclidean inner product.
In HSVs the operator 𝗖 is equal to the Hessian of the 3D-/4DVAR cost function.
The operator 𝗖 is specified to be equal to the full Hessian of the 3DVAR cost function. While 𝗖 is not known in matrix form and determining its square root is not feasible, Barkmeijer et al. (1999) show that one can solve (132) by a generalized eigenvalue problem solver called the generalized Davidson algorithm (see Barkmeijer et al. 1998; Davidson 1975). See also Sleijpen and van der Vorst (1996). Using
and carrying out a coordinate transformation,
Then we have a transformed operator
and the Hessian becomes equal to the sum of identity and a matrix with rank less than or equal to the dimensions of the vector of observations (see Fisher and Courtier 1995).
The reduced-rank Kalman filter requires as input pairs of vectors that satisfy
where 𝗣f is a flow-dependent approximation to the covariance matrix of the background error. Such pairs of vectors can be calculated during the course of Hessian singular vector calculation (see Fisher 1998).
In this calculation, in which the inner product at the initial time is defined by the Hessian matrix of an analysis cost function, the vectors sk are partially evolved singular vectors, while the vectors zk are produced during the adjoint model integration.
Veerse (1999) proposes to take advantage of this form of the appropriate Hessian in order to obtain approximations of the inverse analysis error covariance matrix, using the limited memory inverse Broyden, Fletcher, Goldfarb, and Shanno (BFGS) minimization algorithm.
Let 𝗛 be (∇2J)−1 the inverse Hessian and 𝗛+ the updated version of the inverse Hessian, and
where s is the difference between the new iterate and the previous one in a limited memory quasi-Newton minimization procedure and
is the corresponding gradient increment.
One has the formula
where 〈, 〉 is a scalar product with respect to which the gradient is defined and ⊗ stands for the outer product.
Many minimization methods are implemented by using the inverse Hessian matrix/vector product that is built into the minimization code, such as Nocedal's algorithm (Nocedal 1980). These methods are useful when the second-order adjoint is not available due to either memory or CPU limitations.
9. Numerical experiments: Application of AD Hessian/vector products to the truncated Newton algorithm
For the numerical experiments we consider the truncated-Newton algorithm to minimize the cost function (59) associated with the SWE model (56)–(58). The spatial domain considered is a 6000 km × 4400 km channel with a uniform 21 × 21 spatial grid, such that the dimension of the initial condition vector (u, υ, ϕ)T is 1083, and the Hessian of the cost function is a 1083 × 1083 matrix.
The initial conditions are those of Grammeltvedt (1969). As for the boundary conditions, on the southern and northern boundaries the normal velocity components are set to zero, while periodic boundary conditions are assumed in the west–east direction. Integration is performed with a time increment Δt = 600s and the length of the assimilation window is 10 h. Data assimilation is implemented in a twin experiments framework such that the value of the cost function at the minimum point must be zero. As the set of control parameters, we consider the initial conditions that are perturbed with random values chosen from a uniform distribution.
The second-order adjoint model was generated using TAMC (Giering and Kaminski 1998a). The correctness of the adjoint-generated routines was checked using the small perturbations technique. Assuming that the cost function J(X) is evaluated by the subroutine model (J, X), computation of the Hessian/vector products 𝗚(X)u via automatic differentiation is performed in two steps. First the reverse (adjoint) mode is applied to generate the adjoint model. Next, the tangent (forward) mode is applied to the adjoint model to generate the second-order adjoint (SOA) model. The performance of the minimization process using AD SOA is analyzed versus an approximate Hessian/vector product computation given by (116), with a hand-coded adjoint model implementation. The absolute and relative differences between the computed Hessian/vector product at the first iteration (initial guess state) are shown in Fig. 1 for the first 100 components. The first-order finite-difference method (FD) provides in average an accuracy of two to three significant digits. The optimization process using FD stops after 28 iterations when the line search fails to find an acceptable step size along the search direction, whereas for the SOA method a relative reduction in the cost function up to the machine precision is reached at iteration 29. The evolution of the normalized cost function and gradient norm are presented in Figs. 2 and 3, respectively.
The computational cost is of same order of magnitude for both the finite-difference approach and the exact second-order adjoint approach. The second-order adjoint approach requires integrating the original nonlinear model and its TLM forward in time and integrating the first-order adjoint model and second-order adjoint model backward in time once. The average ratio of the CPU time required to compute the gradient of the cost function to the CPU time used in evaluating the cost function was CPU(∇J)/CPU(J) ≈ 3.7. If we assume that the value of the gradient ∇J(X) in (116) is already available (previously computed in the minimization algorithm), only one additional gradient evaluation ∇J(X + hu) is needed in (116) to evaluate the Hessian/vector product using the FD method. In this case, we then have an average ratio to compute the Hessian/vector product CPU(𝗚u)FD/CPU(J) ≈ 3.7. Using the SOA method to compute the exact Hessian/vector product, we obtained an average CPU(𝗚u)SOA/CPU(J) ≈ 9.4, in agreement with the estimate (A4) in the appendix. We notice that in addition to the Hessian/vector product the AD SOA implementation also provides the value of the gradient of the cost function. The average ratio CPU(𝗚u)SOA/CPU(∇J) ≈ 2.5 we obtained is also in agreement with the CPU estimate (A2) in the appendix.
a. Numerical calculation of Hessian eigenvalues
Iterative methods and the SOA model may be combined to obtain information about the spectrum of the Hessian matrix of the cost function. In this application we used the ARPACK package (Lehoucq et al. 1998) to compute five of the largest and smallest eigenvalues of the Hessian matrix. The method used is the implicitly restarted Arnoldi method (IRAM), which reduces to the implicitly restarted Lanczos method (IRLM) since 𝗚 is symmetric. For our application, only the action of the Hessian matrix on a vector is needed and we provide this routine using the SOA model. The condition number is evaluated as
The computed Ritz values and the relative residuals are included in Table 1 for the Hessian evaluated at the initial guess point, and in Table 2 for the Hessian evaluated at the optimal point X*. For our test example the eigenvalues of the Hessian are positive, such that the Hessian is positive definite and the existence of a minimum point is assured. The condition number of the Hessian is of order k(𝗚) ∼ 104, which explains the slow convergence of the minimization process.
Use of the Hessian of cost function eigenvalue information in regularization of ill-posed problems was illustrated by Alekseev and Navon (2001, 2002). The application consisted of a wavelet regularization approach for dealing with an ill-posed problem of adjoint parameter estimation applied to estimating inflow parameters from downflow data in an inverse convection case applied to the two-dimensional parabolized Navier–Stokes equations. The wavelet method provided a decomposition into two subspaces, by identifying both a well-posed as well as an ill-posed subspace, the scale of which was determined by finding the minimal eigenvalues of the Hessian of a cost functional measuring the lack of fit between model prediction and observed parameters. The control space is transformed into a wavelet space. The Hessian of the cost was obtained either by a discrete differentiation of the gradients of the cost derived from the first-order adjoint or by using the full second-order adjoint. The minimum eigenvalues of the Hessian are obtained either by employing a shifted iteration method following Zou et al. (1992) or by using the Rayleigh quotient. The numerical results obtained illustrated the usefulness and applicability of this algorithm if the Hessian minimal eigenvalue is greater or equal to the square of the data error dispersion, in which case the problem can be considered to be well-posed (i.e., regularized). If the regularization fails, that is, the minimal Hessian eigenvalue is less than the square of the data error dispersion of the problem, the following wavelet scale should be neglected, followed by another algorithm iteration.
10. Summary and conclusions
The recent development of variational methods in operational meteorological centers (European Centre for Medium-Range Weather Forecasts, Météo-France) has demonstrated the strong potential of these methods.
Variational techniques require the development of powerful tools such as the adjoint model, which are useful for the adjustment of the inputs of the model (initial and/or boundary conditions). From the mathematical point of view the first-order adjoint will provide only necessary conditions for an optimal solution. The second-order analysis goes one step further and provides information that is essential for many applications:
(i) Sensitivity analysis should be derived from a second-order analysis, that is, from the derivation of the optimality system. This is made crystal clear when sensitivity with respect to observations is required. In the analysis, observations appear only as a forcing term in the adjoint model; therefore, in order to estimate the impact of observations this is the system that should be derived.
(ii) Second-order information will improve the convergence of the optimization methods, which are the basic algorithmic components of variational analysis.
(iii) The second-order system permits estimating the covariances of the fields. This information is essential for the estimation of the impact of errors on the prediction.
The computational cost to be paid in order to obtain the second-order adjoint system is twofold: 1) We have to consider the computational cost for the derivation of the SOA. It has been seen that we can get it directly from the linear tangent model and from the adjoint model. Only the right-hand sides should be modified. 2) Computing the second-order information. Basically the first-order information has the same dimension as the input of the model. Let n be this dimension. The second-order information will be represented by an n × n matrix. For operational models the computation of the full Hessian matrix is prohibitive; nevertheless, it is possible to extract the most useful information (eigenvalues and eigenvectors, spectrum, condition number, etc.) at a reasonable computational cost.
The numerical results obtained illustrate the ease with which present-day automatic differentiation packages allow one to obtain second-order adjoint models as well as Hessian/vector products. They also confirm numerically the CPU estimates for computational complexity as derived in section 7 (see also Griewank 2000).
Numerical calculation of the leading eigenvalues of the Hessian along with its smallest eigenvalues yields results similar to those obtained by Wang et al. (1998) and allows for valuable insight into the Hessian spectrum, thus allowing us to deduct the important information related to condition number of the Hessian and, hence, to the expected rate of convergence of minimization algorithms.
With the advent of ever more powerful computers, the use of second-order information in data assimilation will be within realistic reach for 3D models and is expected to become more prevalent.
The purpose of this paper has been to demonstrate the importance of new developments in second-order analysis: many directions of research remain open in this domain.
The authors wish to thank two anonymous reviewers whose constructive comments led to a marked improvement in the presentation of the present paper.
The second author would like to acknowledge support from the NSF through Grant ATM-9731472, managed by Dr. Pamela Stephens, whom we would like to thank for her support.
IDOPT is a joint project CNRS–INRIH–Université Joseph Fourier INPG.
Computational Complexity of AD Calculation of the Second-Order Adjoint
Griewank (2000) starts by working out a representation of the complexity measure as a task consisting of moves, adds, multiplications, and nonlinear operations, thus obtaining a representation of work (task) as:
Then run time can be written as
Here w is a vector of positive weights that depend on the computing system and represent the number of clock cycles needed for fetching and/or storing data items, multiplication, addition, and finally for taking into account nonlinear operations.
Usually the vector wT assumes the form
and for most computing platforms μ ≥ max(1, π/2), π ≤ 1 and ν ≤ 2π. For example, this assumption implies that a memory access (μ) is at least as slow as an addition or half a multiplication (π). Griewank (2000) derives the computational complexity of the tangent model (directional derivative) wtang, gradient (first order adjoint) wgrad, and second-order adjoint wSOA normalized by the complexity of the model evaluation as
As mentioned by Nocedal and Wright (1999) automatic differentiation has increasingly been using more sophisticated techniques that allow, when used in reverse mode, for calculation either of full Hessians or Hessian/vector products. However the automatic differentiation technique should not be regarded as a substitute for the user to think that this is a fail-safe product and each derivative calculation obtained with AD should be carefully assessed.
Gay (1996) has shown how to use partial separability of the Hessian in AD while Powell and Toint (1979) and Coleman and More (1984), along with Coleman and Cai (1986), have shown how to estimate a sparse Hessian using either graph-coloring techniques or other highly effective schemes.
Corresponding author address: Dr. I. M. Navon, School of Computational Science and Information Technology, The Florida State University, Tallahassee, FL 32306-4120. Email: email@example.com
* This work is dedicated to the late Professor Jaques Louis Lions in deep homage.