## 1. Introduction

Data assimilation can be described as the ensemble of techniques for retrieving geophysical fields from different sources such as observations, governing equations, statistics, etc.

Being heterogeneous in nature, quality, and density, these data sources have to be put together to optimally retrieve (the meaning of “optimal” has to be precisely defined) the geophysical fields. Due to its inherent operational tasks, meteorology has played an important role in the development of data assimilation techniques. An ever-increasing amount of data and models are considered as an ensemble from which the optimal information should be extracted. Behind most of the classical methods used in meteorology such as optimal interpolation, variational methods, statistical estimation, etc., there is a variational principle, that is, the retrieved fields are obtained through the minimization of some functional depending on the various sources of information. The retrieved fields are obtained through some optimality condition, which can be an Euler or Euler–Lagrange condition if the regularity conditions are satisfied. Since these conditions are first-order conditions, it follows that they involve the first-order derivatives of the functional that is minimized. In this sense, data assimilation techniques are first-order methods. But first-order methods are only necessary conditions for optimality but not sufficient ones. To obtain sufficient conditions we need to proceed one step further and to introduce second-order information. By the same token, from the mathematical point of view sensitivity studies with respect to some parameter can be obtained through Gâteaux derivatives with respect to this parameter. Therefore if we seek the sensitivity of fields that have already been defined through some first-order conditions, we will have to proceed one order of derivation further and in this sense our sensitivity studies require second-order information.

The purpose of this review paper is to show how to obtain and how to use in an efficient way second-order information in data assimilation. In a first part we will show how the second-order derivative can be computed, primarily in a very general framework, then illustrate it with some examples. Then we will show how this second-order information can be linked to the issues of uniqueness of a solution to the problem of data assimilation. This will be shown to be not only a mathematical consideration but also rather a practical issue whereby information can be extracted by studying second-order information.

In a second part of the paper we will proceed to show how to derive sensitivity analyses from models and data. The analysis of the impact of uncertainties in the model and in the data provides essential links between purely deterministic methods (such as variational data assimilation) and stochastic methods (Kalman filter–type). We will then proceed to demonstrate how the link between these methods can be clearly understood through use of second-order information.

Researchers in other disciplines have carried out pioneering work using second-order information. Work in seismology using second-order information and applying it to obtain accurate Hessian/vector products for truncated-Newton minimization was carried out by Santosa and Symes (1988, 1989) and by Symes (1990, 1991, 1993). Reuther (1996) and Arian and Ta'asan (1999) illustrated the importance of second-order adjoint analysis for optimal control and shape optimization for inviscid aerodynamics. Hou and Sheen (1993) used second-order sensitivity analysis for heat conduction problems.

Second-order information was tackled in automatic differentiation (AD) by Abate et al. (1997), Giering and Kaminski (1998a,b), Gay (1996), Hovland (1995), Bischof (1995), Burger et al. (1992), Griewank and Corliss (1991), and Griewank (1993, 2000, 2001), to cite but a few. Several AD packages such as the tangent linear and adjoint model compiler (TAMC) of Giering and Kaminski (1998a) allow calculation of the Hessian of the cost functional.

Early work on second-order information in meteorology includes Thacker (1989) followed by work of Wang et al. (1992, 1993) and Wang (1993). Wang et al. (1995, 1998) considered use of second-order information for optimization purposes namely to obtain truncated-Newton and adjoint Newton algorithms using exact Hessian/vector products. Application of these ideas was presented in Wang et al. (1997).

Kalnay et al. (2000) introduced an elegant and novel pseudo-inverse approach and showed its connection to the adjoint Newton algorithm of Wang et al. (1997) (see Kalnay et al. 2000; Pu and Kalnay 1999; Pu et al. 1997).

Ngodock (1996) applied second-order information in conjunction with sensitivity analysis in the presence of observations and applied it to the ocean circulation. Le Dimet et al. (1997) presented the basic theory for second-order adjoint analysis related to sensitivity analysis. A condensed summary of the theory is presented in Le Dimet and Charpentier (1998).

The structure of the paper is as follows. Section 2 deals with the theory of the second-order adjoint method, both for time-independent and time-dependent models. The methodology is briefly illustrated using the shallow-water equations model. Section 3 deals with the connection between sensitivity analysis and second-order information. Section 4 briefly presents the Kalnay et al. (2000) quasi-inverse method and its connection with second-order information. Issues related to second-order Hessian information in optimization theory are addressed in section 5. Both unconstrained and constrained minimization issues are briefly discussed. Finally the use of accurate Hessian/vector products to improve the performance of the truncated Newton method are presented along with the adjoint truncated-Newton method. A method for approximating the Hessian of the cost function with respect to the control variables proposed by Courtier et al. (1994), based on rank-*p* approximation and bearing similarity to approximation of the Hessian in quasi-Newton methods (see Davidon 1959, 1991), is presented in section 5d.

Section 6 is dedicated to methods of obtaining the second-order adjoint via AD technology, while issues of computational complexity of AD for the second-order adjoint are presented in the appendix. Use of the Hessian of the cost functional to estimate error covariance matrices is briefly discussed in section 7. The use of Hessian singular vectors used for development of a simplified Kalman filter is addressed briefly in section 8.

Finally as a numerical illustration we present in section 9 the application of the second-order adjoint of a limited area model of the shallow-water equations to obtain an accurate Hessian/vector product compared to an approximate Hessian vector product obtained by finite differences. Automatic differentiation is implemented using the adjoint model compiler TAMC. The Hessian/vector information is used in a truncated-Newton minimization algorithm of the cost functional with respect to the initial conditions taken as the control variables and its impact versus the Hessian/vector product obtained via finite differences is assessed. The numerical results obtained verify the theoretically derived computational cost of obtaining the second-order adjoint via automatic differentiation. The Arnoldi package (ARPACK) was then used in conjunction with the second-order adjoint to gain information about the spectrum of the Hessian of the cost function. The unified notation of Ide et al. (1997) for data assimilation will be used.

Summary and conclusions are finally presented in section 10.

## 2. Computing the second-order information

In this section we will deal with deterministic models while the case of stochastic modeling will be discussed later in this manuscript.

**X**

**U**

**X**, the state variable, describes the state of the environment and

**U**is the input of the model, that is, an initial condition that has to be provided to the model to obtain from Eq. (1) a unique solution

**X**(

**U**). We will assume that

**X**and

**U**belong to a space equipped with an inner product.

*J*

**X**

**U**

*J*may be viewed as representing the cost function measuring the discrepancy between the observation and the solution associated with the value

**U**of the input parameter. Therefore the optimal input for the model will minimize

*J.*

### a. First-order necessary conditions

**U**minimizes

*J,*then it satisfies the Euler equations given by

*J*

**U**

**0**

*J*is the gradient of

*J*with respect to control variables.

The gradient of *J* is obtained in the following way:

*F*in some direction

**u**. We may writewhere (ˆ) stands for the Gâteaux derivative. Let

*Z*be an application from ℝ

^{n}into ℝ

^{n}with variable

**U**. We define the Gâteaux derivative of

*Z*in the direction

**u**when this limit exists. For a generic function

*Z*it is given by

*Ẑ*(

**U**,

**u**) is linear in

**u**, we can write

*Ẑ*

**U**

**u**

*Z*

**U**

**u**

*Z*is the gradient of

*Z*with respect to

**U**, and 〈 · , · 〉 stands for the inner product. The Gâteaux derivative is also called a directional derivative.

**X**(or ∂

**U**) is the Jacobian of

**X**(or

**U**) and

The gradient of *J* is obtained by exhibiting the linear dependence of *Ĵ* with respect to **u**. This is done by introducing the adjoint variable **P** (to be defined later according to convenience).

### b. Second-order adjoint

**U**) of

*J*with some vector

**u**. As before we apply a perturbation

**u**to Eqs. (1) and (9), and from Eqs. (9) and (10) we obtain

**Q**and

**R**, two additional variables. To eliminate

**X̂**and

**P**, we will take the inner product of Eqs. (4) and (11) with

**Q**and

**R**, respectively; then we add the results. We then obtain

**u**; then we may write

**Q**and

**R**are defined as being the solution ofthen we obtainFor Eqs. (14)–(15) we took into account the symmetry of the second derivative, for example,leading to some simplifications.

The system (16)–(17) will be called the second-order adjoint. Therefore we can obtain the product of the Hessian by a vector **u** by (i) solving the system (16)–(17) and (ii) applying formula (18).

### c. Remarks

The system (16)–(17), which has to be solved to obtain the Hessian/vector product, can be derived from the Gâteaux derivative (4), which is the same as (17). In the literature, the system (16)–(17) is often called the tangent linear model, this denomination being rather inappropriate because it implies the issue of linearization and the subsequent notion of range of validity, which is not relevant in the case of a derivative.

In the case of an *N*-finite dimensional space the Hessian can be fully computed after *N* integrations of vector components *e*_{i} of the canonical base.

Equation (16) differs from the adjoint model by the forcing terms, which will depend on **u** and **R**.

*α*is the finite-difference interval that has to be judicially chosen. In the incremental three- and four-dimensional variational data assimilation (3D-/4DVAR) approach the Hessian can readily be obtained by differencing two gradients.

However several integrations of the model and of its adjoint model will be necessary in this case to determine the range of validity of the finite-difference approximation (Wang 1995 and references therein).

### d. Time-dependent model

*T*]. The evolution of

**X**between 0 and

*T*is governed by the differential system:Here

**X**belongs to the Hilbert space ℍ, which is a subspace of the

*n*-dimensional Cartesian product space [𝔸(0,

*T*)]

^{n}.

**X**

**U**

^{n}

**V**∈

*T*)]

^{m}is a term used to represent the uncertainties of the model, which we assume to be linearly coupled through the (

*n,*

*m*)-dimensional matrix 𝗕;

**U**is the initial condition; and the criteria

*J*is the discrepancy between the solution of (20)–(21) and observations:where

**H**is the observation matrix, that is, a linear operator mapping

**X**into

_{obs}. The problem consists in determining

**U**and

**V**that minimize

*J.*

**v**on

**V**and

**u**on

**U**gives

**X̂**and

*Ĵ*the Gâteaux derivatives of

**X**and

*J*as a solution of

**P**, the adjoint variable; we take the product of (23) with

**P**after a summation on the interval [0,

*T*] and an integration by parts followed by identification of linearities with respect to

**u**and

**v**in (25); next we conclude that if

**P**is defined as the solution of the adjoint modelthen the components of the gradient ∇

*J*with respect to

**U**and

**V**are

Because **V** is time dependent, its associated adjoint variable **Q** will be also time dependent. Let us remark that the gradient of *J* with respect to **V** will depend on time, which is not surprising since *J* also depends on time. From a computational point of view the discretization of **V** will have to be carried out in such a way that the discretized variable remains in a space of “reasonable” dimension.

**H**on the control variables

**U**and

**V**:The Gâteaux derivatives

**X̂**,

**P**of

**X**, and

**P**in the direction of

**H**are obtained as the solution of the coupled system:

We then introduce **Q** and **R**, second-order adjoint variables. They will be defined later for ease in use of presentations.

**Q**and of (33) with

**R**, integrating from 0 to

*T,*then adding the resulting equations, we may writeThe terms in

**P̂**and

**X̂**are collected and after integration by parts and some additional transformations we obtain

*J.*We then have

We would like to point out that Eq. (44) follows directly from Eq. (43) by using Eq. (42). The product of the Hessian by a vector is obtained exactly by a direct integration of (40) and (42) followed by a backward integration in time of (39) and (41).

*n*integrations of the differential system:with the conditionswhere

**e**

_{i}are the

*n*vectors of the canonical base of ℝ

^{n}thus obtainingOne then integrates

*m*times the differential system:with terminal and, respectively, initial conditionswhere

**f**

_{j}are the

*m*canonical base vectors of ℝ

^{m}obtaining

_{VV}

**f**

_{j}

^{T}

**Q**

*n*+

*m*integrations of the second-order adjoint. The second-order adjoint is easily obtained from the first-order adjoint; differing from it only by some forcing terms, in particular the second-order term. The second equation is that of the linearized model (the tangent linear model).

One can also obtain the product of a vector of the control space, times the Hessian at the cost of a single integration of the second-order adjoint.

### e. Example: The shallow-water equations

The shallow-water equations (SWEs) represent the flow of an incompressible fluid whose depth is small with respect to the horizontal dimension.

**X**= (

*u,*

*υ,*

*ϕ*)

^{T}is the state variable,

*u*and

*υ*are the components of the horizontal velocity,

*ϕ*is the geopotential, and

*f*the Coriolis parameter. We aim to present this example in order to provide a didactic setup, thus we will make the strongest simplifications.

- We neglect the model error, which following the previous notation, implies 𝗕 ≡ 0. We only control the initial conditions.
- We impose periodic boundary conditions.
- The observations are assumed continuous in both space and time, which is tantamount to assuming
**H**≡ 𝗜, where 𝗜 is the identity operator. Let**U**_{0}= (*u*_{0},*υ*_{0},*ϕ*_{0})^{T}, that is, the initial condition, then the cost function assumes the formwhere*γ*is a nonunit weighting term.

**X**

*u*

*υ*

*ϕ*

^{T}are the directional derivatives in the direction of the perturbation

**h**= (

**h**

_{u},

**h**

_{υ},

**h**

_{ϕ})

^{T}applied to the initial condition and we obtain

**P**= (

*ũ,*

*υ̃*

*ϕ̃*

^{T}be the adjoint variable, then the adjoint model satisfies

**Q**= (

*û,*

*υ̂*

*ϕ̂*

^{T}and the variable

**R**= (

*u*

*υ*

*ϕ*

^{T}defined by the TLM:We see that formally the first- and second-order adjoint models differ only by second-order terms, which contain the adjoint variables.

The calculation of second-order derivatives requires the storage of the model trajectory, the tangent linear model, and the adjoint model.

## 3. Sensitivity analysis and second-order information

### a. General sensitivity analysis

In general a model has the following kinds of variables:

- (i)
*State variable*: 𝗭 in a space𝕃 , which describes the physical properties of the medium (velocity, pressure, temperature, …); 𝗭 depends on time and space. - (ii)
*Input variable*: 𝗜 in a space𝕃 , which has to be provided to the model (e.g., initial or boundary conditions), most of the time these variables are not directly measured but they can be estimated through data assimilation. - (iii)
*Parameters*: 𝗞 represents empirical parameters (e.g., diffusivity) that most models contain, which have to be tuned to adjust the model to the observations.

In many applications sensitivity analysis is carried out; for instance if we consider some scalar quantity linked to a solution of the model, what will be its variation if there is some perturbation on the inputs of the model?

*sensitivity analysis*is defined by a so-called response function 𝗚: 𝕃 → ℝ, depending on the state variable

**X**(and therefore indirectly depending on 𝗜 and 𝗞). By definition the sensitivity of 𝗚 with respect to 𝗞 (respectively 𝗜) is the gradient of 𝗚 with respect to 𝗞 (respectively 𝗜):

There are two ways to estimate the sensitivity.

### b. Sensitivity analysis via finite differences

*N.*Let

*k*

_{i}, 1 ≤

*i*≤

*N*be the components of 𝗞, thenand estimation of ∂𝗚/∂

*k*

_{i}can be carried out by computingwhere

*k*

_{i}are the canonical base vectors. This procedure for estimating a sensitivity is simple. Nevertheless it can be costly if

*N*is large since it will require as many integrations of the model as

*N.*Furthermore, (70) gives only one approximation of the gradient. The scalar

*α*has to be chosen such that the model has a linear behavior with respect to

*α.*The determination of

*α*may require several integrations of the model for each value of

*k*

_{i}. The main advantage of this method is that it does not require important software development.

### c. Sensitivity via the adjoint

**i**and

**k**be perturbations on 𝗜 and 𝗞. The derivation of (69) leads toand for the response function,

The gradient will be obtained by exhibiting the linear dependence of **Ĝ****i** and **k**. Let us introduce **Π**, an adjoint variable of the same dimension as 𝗭.

**Π**gives

**S**

_{I}**S**

_{K}It is worth noting that the sensitivity is obtained only after one run of the adjoint model and the result is exact. The cost to be paid is in software development since an adjoint model has to be developed.

### d. Sensitivity analysis and data assimilation

**X**is a state variable, 𝗜 the input,

**P**the adjoint variable, then

**X**and

**P**are solutions of the following optimality system (OS),

*J*(

**X**, 𝗜) being the cost function:

For sensitivity studies in the presence of observations, with a given response function we have to consider the OS as a generalized model **X**, **P**)^{T}, and a general sensitivity analysis has to be applied to this general model. Therefore the adjoint of the optimally system has to be derived.

**i**on 𝗜, we may writeand for the response function 𝗚 defined earlier we get

**X̂**and

**P̂**are the Gâteaux derivatives of

**X**and

**P**in the direction

**i**. After second-order adjoint variables

**Q**and

**R**are introduced, we take the inner product of (74) and (75) by

**Q**and (76) by

**R**; then we add the three equations and we may write

**Q**and

**R**are defined as solution ofthen we get the gradient of

*G*with respect to 𝗜, that is the sensitivity

Therefore the algorithm to get the sensitivity is as follows:

- (i) solve the optimality system to obtain
**X**and**P** - (ii) solve the coupled system (78) to obtain
**Q**and**R**, and - (iii) compute the sensitivity by (79).

The sensitivity in the presence of observations requires taking into account the second-order information. A very simple example given by Le Dimet et al. (1997) clearly shows the necessity of the introduction of this term.

## 4. Kalnay et al. (2000) quasi-inverse method and second-order information

*δ*

**x**is the difference between the analysis and the background at the beginning of the assimilation window, 𝗟 and 𝗟

^{T}are the TLM and its adjoint, and

**H**is the tangent linear version of the forward observation operator

*J*with respect to the initial change

*δ*

**x**=

**x**

^{a}−

**x**

^{b}, where

**x**

^{a}and

**x**

^{b}are the analysis and first guess, respectively, we obtain

*J*

^{T}

^{−1}

*δ*

**x**

**H**

^{T}

^{−1}

**H**

*δ*

**x**

*δ*

**y**

*δ*

**x**

^{i}

*α*

_{i}

*J*

^{i−1}

*i*is the minimization counter, where

*α*

_{i}is the step size in the minimization algorithm.

One stops after a number of minimization iterations when ‖∇*J*‖ is small enough to satisfy a convergence criterion.

In order to determine the optimal value of the step size, the minimization algorithm, say quasi-Newton, requires additional computations of the gradient ∇*J*^{i−1}, so that the number of direct and adjoint integrations required by adjoint 4DVAR can be larger than the number of minimization iterations (see Kalnay et al. (2000).

The inverse 3DVAR approach of Kalnay seeks to obtain directly the “perfect solution,” that is, the special *δ***x** that makes ∇*J* = 0, provided *δ***x** is small.

*δ*

**x**

^{−1}

**H**

^{T}

^{−1}

^{−1}

^{T}

^{−1}

*δ*

**y**

^{−1}, we obtain

*δ*

**x**

^{−1}

^{−1}

^{T}

^{−1}

^{−1}

^{T}

^{−1}

*δ*

**y**

As shown by Kalnay et al. (2000) this is equivalent to the adjoint Newton algorithm used by Wang et al. (1997) except that it does not require a line minimization.

Wang et al. (1998) proposed an adjoint Newton algorithm that also required the backward integration of the tangent linear model and proposed a reformulation of the adjoint Newton when the TLM is not invertible. They did not explore this idea in depth. Physical processes are generally not parameterized in a reversible form in atmospheric models—a problem that can be only overcome to some extent by using simplified reversible physics. Also truly dissipative processes in atmospheric models are not reversible and as such will constitute a problem for the inverse 3DVAR. To show the link of inverse 3DVAR to second-order information we follow Kalnay et al. (2000) to show that inverse 3DVAR is equivalent to using a perfect Newton iterative method to solve the minimization problem at a given time level.

**x**+

*δ*

**x**given that our present estimate of the solution is

**x**, then expanding ∇

*J*(

**x**+

*δ*

**x**) in a Taylor series to second term yields

*J*

**x**

*δ*

**x**

*J*

**x**

^{2}

*J*

**x**

*δ*

**x**

**0**

^{2}

*J*(

**x**) is the Hessian matrix.

*δ*

**x**

^{2}

*J*

**x**

^{−1}

*J*

**x**

^{2}

*J*

**x**

^{T}

^{−1}

^{T}

^{−1}

Since cost functions used in 4DVAR are close to quadratic functions, one may view 3DVAR as a perfect preconditioner of a simplified 4DVAR problem.

In general, availability of second-order information allows powerful minimization algorithms to be used (Wang et al. 1995, 1997) even when the inverse 3DVAR is difficult to obtain as is the case with full physics models.

## 5. Hessian information in optimization theory

Hessian information is crucial in many aspects of both constrained and unconstrained minimization. All minimization methods start by assuming a quadratic model in the vicinity of the minimum of a multivariate minimization problem.

**X*** to be a stationary point is

**X**

**X*** is positive definite.

### a. Spectrum of the Hessian and rate of convergence of unconstrained minimization

**X**) of (89) and let

**X*** denote a local minimizer point that satisfies the condition

**X**

**X**

**X**such that

**X**

**X**

*ϵ,*

*ϵ*is typically a small positive number whose value may depend on the value of

**X***. We define

**X***) as an acceptable solution of (89).

**X*** is an absolute minimum, then

**X**

**X***) of

**X*** is positive definite, that is,

**X***:where

**X***) substantially affects the size of ‖

**X**−

**X***‖, that is, the rate of convergence of the unconstrained minimization (Gill et al. 1981).

If 𝗚(**X***) is ill-conditioned, the error in **X** will vary with the direction of the perturbation **p**.

If **p** is a linear combination of eigenvectors of 𝗚(**X***) corresponding to the largest eigenvalues, the size of ‖**X** − **X***‖ will be relatively small, while if, on the other hand, **p** is a linear combination of eigenvectors of 𝗚(**X***) corresponding to the smallest eigenvalues, the size of ‖**X** − **X***‖ will be relatively large; that is, there will be slow convergence.

### b. Role of the Hessian in constrained minimization

The Hessian information plays a very important role in constrained optimization as well. We shall deal here with optimality conditions where again Taylor series approximations are used to analyze the behavior of the objective function *h*_{i} about a local constrained minimizer **X***.

We shall consider first optimal conditions for linear equality constraints.

*m*×

*n*matrix,

*m*≤

*n.*We assume that

**X**=

**b**can be recast as an equivalent unconstrained problem. Assume we have a feasible point

**X**

**X**

**b**

**X**

**X**

**p**

**p**is a feasible direction.

**p**satisfying

**p**

**0**

**X**

**X**

**X**

**p**

**p**

*n*×

*r*with

*r*≥

*n*−

*m.*Then the feasible region is given byThen the constrained minimization problem in

**X**is equivalent to the unconstrained problem:where

**X**is a feasible point (Gill et al. 1981; Nash and Sofer 1996). The function Φ is the restriction of

*n*−

*m*variables. The constrained problem has been transformed into an unconstrained problem with a reduced number of variables.

**X**=

**X**

**v**The vector

**v**

^{T}

**X**

**X**. Similarly the matrix

^{2}

**v**

^{T}

^{2}

**X**

**X**. If

**X*** is a local solution of the constrained problem, thenand

**v*** is the local minimizer of Φ. Hence we can write

**v**

^{2}Φ(

**v***) is positive semidefinite.

**X*** is a local minimizer of

^{T}

**X**

^{T}∇

^{2}

**X***)𝗭 is positive semidefinite. That is, the reduced gradient is zero and the reduced Hessian matrix is positive semidefinite. (The second-order derivative information is used to distinguish local minimizers from other stationary points.) The second-order condition is equivalent to the conditionNoting that

**p**= 𝗭

**v**is a null-space vector, we can rewrite (107) asthat is, the Hessian matrix at

**X*** must be positive semidefinite on the null space of 𝗔.

### c. Application of second-order-adjoint technique to obtain exact Hessian/vector product

We will exemplify this application by considering a truncated-Newton algorithm for large-scale minimization.

#### 1) Description of Truncated-Newton methods

^{2}

*f*

**X**

_{k}

**p**

*f*

**X**

_{k}

**X**

**b**

**v**

^{2}

*f*

**X**

_{k}

**v**

**v**. One way to bypass the storage difficulty is to approximate the Hessian matrix/vector products using values of the gradient in such a way that the Hessian matrix need not be computed or stored. Using Taylor serieswe obtainthat is, we approximate the matrix/vector productfor some small values of

*h.*

The task of choosing an adequate *h* is an arduous one (see Nash and Sofer 1996, chapter 11.4.1 and references therein). For in-depth descriptions of the truncated-Newton (also referred to as the Hessian-free) method, see Nash (1984a–d, 1985) and Nash and Sofer (1989a,b), as well as Schlick and Fogelson (1992a,b), and early work by Dembo et al. (1982) and Dembo and Steihaug (1983). A comparison of limited memory quasi-Newton (see Liu and Nocedal 1989) and truncated-Newton methods is provided by Nash and Nocedal (1991), while a comprehensive well-written survey of truncated-Newton methods is presented in Nash (2000). A comparison between limited memory quasi-Newton and truncated-Newton methods applied to a meteorological problem is described in depth by Zou et al. (1990, 1993).

### d. A method for estimating the Hessian matrix

**R**= 1, which yieldswhere 𝗕 is an observation operator,

**X**(

*t*

_{r}) the vector of model control variables,

**X**

^{obs}(

*t*

_{r}) the vector of observational data at time

*t*=

*t*

_{r},

*t*=

*t*

_{N}is the final time of model integration, and

**W**(

*t*

_{r}) is the inverse of the observation covariance matrix.

*t*=

*t*

_{0}to

*t*=

*t*

_{N}isAt the minimum

**X**

_{min}, the gradient of the cost function ∇

*J*vanishes.

**(**

*η**t*

_{0}) and

**(**

*η**t*

_{N}), with zero expectations and whose covariances are the diagonal elements of 𝗪

^{−1}(

*t*

_{0}) and 𝗪

^{−1}(

*t*

_{N}), respectively, to the observationsthen ∇

*J*(at

**X**

^{*}

_{min}

*E*

*J*

*J*

^{T}

*E*[·] is the mathematical expectation and 𝗝″ is the Hessian matrix. We can see that we obtain an outer vector product expression, which is a rank-one matrix.

*i*of

**X**

^{obs}

_{1}

*t*

_{0}) and

**X**

^{obs}

_{1}

*t*

_{N}) we can calculate ∇

*J*

^{i}at

**X**

_{min}and after

*p*such realizations we obtain at most a rank

*p*approximation of the Hessian of the cost function (Yang et al. 1996; Rabier and Courtier 1992; Courtier et al. 1994):This approach is analogous to quasi-Newton methods where symmetric rank-one or rank-two updates are collected to update approximations of the Hessian or the inverse of the Hessian matrix as the minimization proceeds. As shown by Yang et al. (1996) use of the approximate

^{″}

_{p}

## 6. Second-order adjoint via automatic differentiation

There is an increased interest in obtaining the second-order adjoint via automatic differentiation (AD).

Research work has been carried out in the recent version of the FORTRAN TAMC AD package designed by Giering and Kaminski (1998a) allowing for both the calculation of Hessian/vector products as well as for the more computationally expensive derivation of the full Hessian with respect to the control variables. Comparable CPU times to those required for hand coding were reported (Giering and Kaminski 1998b).

The importance of the Hessian/vector products derived by AD is particularly important in minimization where there is often interest not only in the first but rather in the second derivatives of the cost functional, which convey crucial information.

Griewank (2000) estimated the computational complexity of implementing second-order adjoints in a thorough manner.

He found that for calculating Hessian/vector products an effort leading to a run-time ratio of about a factor of 13 was required.

The calculation of the ratio between the effort required to obtain Hessian/vector products and that required to calculate the gradient of the cost was found to be a factor between 2 and 3 only.

*f*are being evaluated as a sequence of assignments:Here variables

*υ*

_{i}are real scalars and the elemental functions

*ϕ*

_{i}are either binary arithmetic operations or univariate intrinsics.

*n*variables

*υ*

_{j−n}

*x*

_{j}

*j*

*n*

*m*variables

*y*

_{i}

*υ*

_{l+i}

*i*

*m*

*f*′(

*x*), it can be represented by a sparse triangular matrix This 𝗖 can also be interpreted as a computational graph whose vertices are elemental functions

*ϕ*

_{i}and the edges are the nonvanishing partial

*c*

_{ij}.

^{2}

*f*

^{n×n}

The derivation originates in the approach put forward by Griewank (2001) of Jacobian accumulation procedures using implicit function theorem.

Griewank (2000) derives a class of derivative accumulation procedures as edge eliminations on the linearized computational Hessian graph.

*E*(

*X,*

*υ*) = 0. Applying the implicit function theorem one obtains the derivative of ∂

*υ*/∂

*x.*The (

*n*+

*i*)-th row of ∂

*υ*/∂

*x*represents exactly the gradient

**V̇**

_{i}

_{x}

*υ*

_{i}

*i*th intermediate value

*υ*

_{i}with respect to all independent variables:

**V̇**

^{T}

_{i}

^{l−m}

_{i=1−n}

^{(l−m+n)×n}

**V̇**

^{T}

_{i}

_{x}

*υ*

_{i}

*υ*

_{i}, with respect to the independents

*x*∈ ℝ

^{n}, which is computed during a forward sweep to calculate the gradient

*f*

_{x}

*f*

^{2}

*f*

^{T}

^{n×n}

^{(l−m+n)×(l−m+n)}is zero except for diagonal elements of the form

*υ*

_{j}

*ϕ*

_{j}and diagonal blocks of the form (

*ϕ*

_{j}being a nonlinear elemental).

One can show that first-order derivatives form the nonzero elements of matrix **Ż**

^{2}

*f*

**V̇**

^{T}

_{i}

_{x}

*υ*

_{i}

*υ*

_{i}are intermediates with respect to independents

*x*∈ ℝ

^{n}.

## 7. Use of Hessian of cost functional to estimate error covariance matrices

A relationship exists between the inverse Hessian matrix and the analysis error covariance matrix of either 3DVAR or 4DVAR (see Thacker 1989; Thepaut and Courtier 1991; Rabier and Courtier 1992; Yang et al. 1996; Le Dimet et al. 1997).

**x**

_{b}is the unbiased background field and y the set of unbiased observations, both being realizations of random variables of covariances 𝗕 and 𝗢, respectively, and where the operator 𝗛 computes the model equivalent

**x**of the observation

**y**, then the Hessian 𝗝″ of the cost function

*J*at the minimum is given by

^{−1}

^{T}

^{−1}

**x**

_{a}the result of the minimization (i.e., the analysis) and

**x**

_{t}the truth, one sees that the error covariance at the minimum isA requirement is that the background error and the observation error are uncorrelated (Rabier and Courtier 1992; Fisher and Courtier 1995). See also the work of Thepaut and Moll (1990) who point out that the diagonal of the Hessian is optimal among all diagonal preconditioners.

## 8. Hessian singular vectors

Computing Hessian singular vectors (HSVs) uses the full Hessian of the cost function in the variational data assimilation, which can be viewed as an approximation of the inverse of the analysis error covariance matrix and it is used at initial time to define a norm. The total energy norm is still used at optimization time. See work by Barkmeijer et al. (1998, 1999). The HSVs are consistent with the 3DVAR estimates of the analysis error statistics. They are also defined in the context of 4DVAR. In practice one never knows the full 3DVAR Hessian in its matrix form and a generalized eigenvalue problem is solved as will be described below.

The HSVs are also used in a method first proposed by Courtier (1993) and tested by Rabier et al. (1997) for the development of a simplified Kalman filter fully described by Fisher (1998) and compared with a low-resolution explicit extended Kalman filter by Ehrendorfer and Bouttier (1998).

Let 𝗠 be the propagator of the tangent linear model, and 𝗣 a projection operator setting a vector to zero outside a given domain.

Consider positive-definite and symmetric operators including a norm at initial and optimization time, respectively.

*x*

*λ*

*x*

In HSVs the operator 𝗖 is equal to the Hessian of the 3D-/4DVAR cost function.

^{2}

*J*

^{−1}

^{T}

^{−1}

*x*

^{−1}

*x,*

^{−1}

^{−1}

^{T}

**z**

_{k}

^{f}

^{−1}

**s**

_{k}

^{f}is a flow-dependent approximation to the covariance matrix of the background error. Such pairs of vectors can be calculated during the course of Hessian singular vector calculation (see Fisher 1998).

In this calculation, in which the inner product at the initial time is defined by the Hessian matrix of an analysis cost function, the vectors **s**_{k} are partially evolved singular vectors, while the vectors **z**_{k} are produced during the adjoint model integration.

Veerse (1999) proposes to take advantage of this form of the appropriate Hessian in order to obtain approximations of the inverse analysis error covariance matrix, using the limited memory inverse Broyden, Fletcher, Goldfarb, and Shanno (BFGS) minimization algorithm.

^{2}

*J*)

^{−1}the inverse Hessian and 𝗛

^{+}the updated version of the inverse Hessian, and

**s**

**x**

^{n+1}

**x**

^{n}

**s**is the difference between the new iterate and the previous one in a limited memory quasi-Newton minimization procedure and

**y**

**g**

^{n+1}

**g**

^{n}

Many minimization methods are implemented by using the inverse Hessian matrix/vector product that is built into the minimization code, such as Nocedal's algorithm (Nocedal 1980). These methods are useful when the second-order adjoint is not available due to either memory or CPU limitations.

## 9. Numerical experiments: Application of AD Hessian/vector products to the truncated Newton algorithm

For the numerical experiments we consider the truncated-Newton algorithm to minimize the cost function (59) associated with the SWE model (56)–(58). The spatial domain considered is a 6000 km × 4400 km channel with a uniform 21 × 21 spatial grid, such that the dimension of the initial condition vector (*u,* *υ,* *ϕ*)^{T} is 1083, and the Hessian of the cost function is a 1083 × 1083 matrix.

The initial conditions are those of Grammeltvedt (1969). As for the boundary conditions, on the southern and northern boundaries the normal velocity components are set to zero, while periodic boundary conditions are assumed in the west–east direction. Integration is performed with a time increment Δ*t* = 600s and the length of the assimilation window is 10 h. Data assimilation is implemented in a twin experiments framework such that the value of the cost function at the minimum point must be zero. As the set of control parameters, we consider the initial conditions that are perturbed with random values chosen from a uniform distribution.

The second-order adjoint model was generated using TAMC (Giering and Kaminski 1998a). The correctness of the adjoint-generated routines was checked using the small perturbations technique. Assuming that the cost function *J*(**X**) is evaluated by the subroutine model (*J,* **X**), computation of the Hessian/vector products 𝗚(**X**)**u** via automatic differentiation is performed in two steps. First the reverse (adjoint) mode is applied to generate the adjoint model. Next, the tangent (forward) mode is applied to the adjoint model to generate the second-order adjoint (SOA) model. The performance of the minimization process using AD SOA is analyzed versus an approximate Hessian/vector product computation given by (116), with a hand-coded adjoint model implementation. The absolute and relative differences between the computed Hessian/vector product at the first iteration (initial guess state) are shown in Fig. 1 for the first 100 components. The first-order finite-difference method (FD) provides in average an accuracy of two to three significant digits. The optimization process using FD stops after 28 iterations when the line search fails to find an acceptable step size along the search direction, whereas for the SOA method a relative reduction in the cost function up to the machine precision is reached at iteration 29. The evolution of the normalized cost function and gradient norm are presented in Figs. 2 and 3, respectively.

The computational cost is of same order of magnitude for both the finite-difference approach and the exact second-order adjoint approach. The second-order adjoint approach requires integrating the original nonlinear model and its TLM forward in time and integrating the first-order adjoint model and second-order adjoint model backward in time once. The average ratio of the CPU time required to compute the gradient of the cost function to the CPU time used in evaluating the cost function was CPU(∇*J*)/CPU(*J*) ≈ 3.7. If we assume that the value of the gradient ∇*J*(**X**) in (116) is already available (previously computed in the minimization algorithm), only one additional gradient evaluation ∇*J*(**X** + *h***u**) is needed in (116) to evaluate the Hessian/vector product using the FD method. In this case, we then have an average ratio to compute the Hessian/vector product CPU(𝗚**u**)_{FD}/CPU(*J*) ≈ 3.7. Using the SOA method to compute the exact Hessian/vector product, we obtained an average CPU(𝗚**u**)_{SOA}/CPU(*J*) ≈ 9.4, in agreement with the estimate (A4) in the appendix. We notice that in addition to the Hessian/vector product the AD SOA implementation also provides the value of the gradient of the cost function. The average ratio CPU(𝗚**u**)_{SOA}/CPU(∇*J*) ≈ 2.5 we obtained is also in agreement with the CPU estimate (A2) in the appendix.

### a. Numerical calculation of Hessian eigenvalues

The computed Ritz values and the relative residuals are included in Table 1 for the Hessian evaluated at the initial guess point, and in Table 2 for the Hessian evaluated at the optimal point **X***. For our test example the eigenvalues of the Hessian are positive, such that the Hessian is positive definite and the existence of a minimum point is assured. The condition number of the Hessian is of order *k*(𝗚) ∼ 10^{4}, which explains the slow convergence of the minimization process.

Use of the Hessian of cost function eigenvalue information in regularization of ill-posed problems was illustrated by Alekseev and Navon (2001, 2002). The application consisted of a wavelet regularization approach for dealing with an ill-posed problem of adjoint parameter estimation applied to estimating inflow parameters from downflow data in an inverse convection case applied to the two-dimensional parabolized Navier–Stokes equations. The wavelet method provided a decomposition into two subspaces, by identifying both a well-posed as well as an ill-posed subspace, the scale of which was determined by finding the minimal eigenvalues of the Hessian of a cost functional measuring the lack of fit between model prediction and observed parameters. The control space is transformed into a wavelet space. The Hessian of the cost was obtained either by a discrete differentiation of the gradients of the cost derived from the first-order adjoint or by using the full second-order adjoint. The minimum eigenvalues of the Hessian are obtained either by employing a shifted iteration method following Zou et al. (1992) or by using the Rayleigh quotient. The numerical results obtained illustrated the usefulness and applicability of this algorithm if the Hessian minimal eigenvalue is greater or equal to the square of the data error dispersion, in which case the problem can be considered to be well-posed (i.e., regularized). If the regularization fails, that is, the minimal Hessian eigenvalue is less than the square of the data error dispersion of the problem, the following wavelet scale should be neglected, followed by another algorithm iteration.

## 10. Summary and conclusions

The recent development of variational methods in operational meteorological centers (European Centre for Medium-Range Weather Forecasts, Météo-France) has demonstrated the strong potential of these methods.

Variational techniques require the development of powerful tools such as the adjoint model, which are useful for the adjustment of the inputs of the model (initial and/or boundary conditions). From the mathematical point of view the first-order adjoint will provide only necessary conditions for an optimal solution. The second-order analysis goes one step further and provides information that is essential for many applications:

- (i) Sensitivity analysis should be derived from a second-order analysis, that is, from the derivation of the optimality system. This is made crystal clear when sensitivity with respect to observations is required. In the analysis, observations appear only as a forcing term in the adjoint model; therefore, in order to estimate the impact of observations this is the system that should be derived.
- (ii) Second-order information will improve the convergence of the optimization methods, which are the basic algorithmic components of variational analysis.
- (iii) The second-order system permits estimating the covariances of the fields. This information is essential for the estimation of the impact of errors on the prediction.

*n*be this dimension. The second-order information will be represented by an

*n*×

*n*matrix. For operational models the computation of the full Hessian matrix is prohibitive; nevertheless, it is possible to extract the most useful information (eigenvalues and eigenvectors, spectrum, condition number, etc.) at a reasonable computational cost.

The numerical results obtained illustrate the ease with which present-day automatic differentiation packages allow one to obtain second-order adjoint models as well as Hessian/vector products. They also confirm numerically the CPU estimates for computational complexity as derived in section 7 (see also Griewank 2000).

Numerical calculation of the leading eigenvalues of the Hessian along with its smallest eigenvalues yields results similar to those obtained by Wang et al. (1998) and allows for valuable insight into the Hessian spectrum, thus allowing us to deduct the important information related to condition number of the Hessian and, hence, to the expected rate of convergence of minimization algorithms.

With the advent of ever more powerful computers, the use of second-order information in data assimilation will be within realistic reach for 3D models and is expected to become more prevalent.

The purpose of this paper has been to demonstrate the importance of new developments in second-order analysis: many directions of research remain open in this domain.

The authors wish to thank two anonymous reviewers whose constructive comments led to a marked improvement in the presentation of the present paper.

The second author would like to acknowledge support from the NSF through Grant ATM-9731472, managed by Dr. Pamela Stephens, whom we would like to thank for her support.

IDOPT is a joint project CNRS–INRIH–Université Joseph Fourier INPG.

## REFERENCES

Abate, J., , C. Bischof, , A. Carle, , and L. Roh, 1997: Algorithms and design for a second-order automatic differentiation module.

*Proc. Int. Symp. on Symbolic and Algebraic Computing (ISSAC) ‘97,*Maui, HI, Association of Computing Machinery, 149–155.Alekseev, K. A., , and I. M. Navon, 2001: The analysis of an ill-posed problem using multiscale resolution and second order adjoint techniques.

,*Comput. Methods Appl. Mech. Eng***190****,**1937–1953.Alekseev, K. A., , and I. M. Navon, 2002: On estimation of temperature uncertainty using the second order adjoint problem.

*Int. J. Comput. Fluid Dyn.*, in press.Arian, E., , and S. Ta'asan, 1999: Analysis of the Hessian for aerodynamic optimization: inviscid flow.

,*Comput. Fluids***28****,**853–877.Averbukh, V. Z., , S. Figueroa, , and T. Schlick, 1994: Remark on Algorithm-566.

,*ACM Trans. Math Software***20****,**282–285.Barkmeijer, J., , M. van Gijzen, , and F. Bouttier, 1998: Singular vectors and estimates of the analysis-error covariance metric.

,*Quart. J. Roy. Meteor. Soc***124A****,**1695–1713.Barkmeijer, J., , R. Buizza, , and T. N. Palmer, 1999: 3D-Var Hessian singular vectors and their potential use in the ECMWF Ensemble Prediction System.

,*Quart. J. Roy. Meteor. Soc***125B****,**2333–2351.Bischof, C. H., 1995: Automatic differentiation, tangent linear models, and (pseudo) adjoints.

*Proceedings of the Workshop on High-Performance Computing in the Geosciences,*F.-X. Le Dimet, Ed., NATO Advanced Science Institutes Series C: Mathematical and Physical Sciences, Vol. 462, Kluwer Academic, 59–80.Burger, J., , J. L. Brizaut, , and M. Pogu, 1992: Comparison of two methods for the calculation of the gradient and of the Hessian of the cost functions associated with differential systems.

,*Math. Comput. Simul***34****,**551–562.Coleman, T. F., , and J. J. More, 1984: Estimation of sparse Hessian matrices and graph-coloring problems.

,*Math. Program***28****,**243–270.Coleman, T. F., , and J. Y. Cai, 1986: The cyclic coloring problem and estimation of sparse Hessian matrices.

,*SIAM J. Algebra Discrete Math***7****,**221–235.Coleman, T. F., , and J. Y. Cai, 1985a: Fortran subroutines for estimating sparse Hessian matrices.

,*ACM Trans. Math. Software***11**(4,) 378–378.Coleman, T. F., , B. S. Garbow, , and J. J. More, 1985b: Software for estimating sparse Hessian matrices.

,*ACM Trans. Math. Software***11**(4,) 363–377.Courtier, P., 1993: Introduction to numerical weather prediction data assimilation methods.

*Proc. ECMWF Seminar on Developments in the Use of Satellite Data in Numerical Weather Prediction,*Reading, United Kingdom, ECMWF, 189–209.Courtier, P., , J-N. Thepaut, , and A. Hollingsworth, 1994: A strategy for operational implementation of 4D-Var, using an incremental approach.

,*Quart J. Roy. Meteor. Soc***120****,**1367–1388.Davidon, W. C., 1991: Variable metric method for minimization.

,*SIAM J. Optim***1****,**1–17.Davidson, E. R., 1975: The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of a large real symmetric matrices.

,*J. Comput. Phys***17****,**87–94.Dembo, R. S., , and T. Steihaug, 1983: Truncated-Newton algorithms for large-scale unconstrained optimization.

,*Math. Program***26****,**190–212.Dembo, R. S., , S. C. Eisenstat, , and T. Steihaug, 1982: Inexact Newton methods.

,*SIAM J. Numer. Anal***19****,**400–408.Dixon, L. C. W., 1991: Use of automatic differentiation for calculating Hessians and Newton steps.

*Automatic Differentiation of Algorithms: Theory, Implementation, and Application,*A. Griewank and G. F. Corliss, Eds., SIAM, 115–125.Ehrendorfer, M., , and F. Bouttier, 1998: An explicit low-resolution extended Kalman filter: Implementation and preliminary experimentation. ECMWF Tech. Memo. 259, 27 pp.

Fisher, M., 1998: Development of a simplified Kalman filter. ECMWF Tech. Memo. 260, 16 pp.

Fisher, M., , and P. Courtier, 1995: Estimating the covariance matrices of analysis and forecast errors in variational data assimilation. ECMWF Tech. Memo. 220, 28 pp.

Forsythe, G. E., , and E. G. Strauss, 1955: On best conditioned matrices.

,*Proc. Amer. Math. Soc***6****,**340–345.Gauthier, P., 1992: Chaos and quadri-dimensional data assimilation: A study based on the Lorentz model.

,*Tellus***44A****,**2–17.Gay, D. M., 1996: More AD of nonlinear AMPL models: Computing Hessian information and exploiting partial separability in computational differentiation: Techniques, applications, and tools.

*Proceedings in Applied Mathematics,*M. Berz et al., Eds., Vol. 89, SIAM, 173–184.Giering, R., , and T. Kaminski, 1998a: Recipes for adjoint code construction.

,*ACM Trans. Math. Software***24**(4,) 437–474.Giering, R., , and T. Kaminski, 1998b: Using TAMC to generate efficient adjoint code: Comparison of automatically generated code for evaluation of first and second order derivatives to hand written code from the Minpack-2 collection.

*Automatic Differentiation for Adjoint Code Generation,*C. Faure, Ed., INRIA Research Rep. 3555, 31–37.Gilbert, J. C., 1992: Automatic differentiation and iterative processes.

,*Optim. Methods Software***1****,**13–21.Gill, P. E., , and W. Murray, 1979: Newton-type methods for unconstrained and linearly constrained optimization.

,*Math. Program***28****,**311–350.Gill, P. E., , W. Murray, , and M. H. Wright, 1981:

*Practical Optimization*. Academic Press, 401 pp.Grammeltvedt, A., 1969: A survey of finite-difference schemes for the primitive equations for a barotropic fluid.

,*Mon. Wea. Rev***97****,**387–404.Griewank, A., 1993: Some bounds on the complexity of gradients, Jacobians, and Hessians.

*Complexity in Nonlinear Optimization,*P. M. Pardalos, Ed., World Scientific, 128–161.Griewank, A., 2000:

*Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation*. Frontiers in Applied Mathematics, Vol. 19, SIAM, 369 pp.Griewank, A., 2001: Complexity of gradients, Jacobians and Hessians.

*Encyclopaedia of Optimization,*C. A. Floudas and P. M. Pardalos, Eds., Vol. 1, Kluwer Academic, 290–300.Griewank, A., , and G. F. Corliss, 1991:

*Automatic Differentiation of Algorithms: Theory, Implementation, and Application*. SIAM, 353 pp.Hou, G. J-W., , and J. Sheen, 1993: Numerical methods for second order shape sensitivity analysis with application to heat conduction problems.

,*Int. J. Numer. Methods Eng***36****,**417–435.Hovland, P., 1995: Using ADIFOR 1.0 to Compute Hessians. Center for Research on Parallel Computation Tech. Rep. CRPC-TR95540-S, Rice University, Houston, TX, 12 pp.

Ide, K., , P. Courtier, , M. Ghil, , and A. Lorenc, 1997: Unified notation for data assimilation: Operational sequential and variational.

,*J. Meteor. Soc Japan***75B****,**71–79.Jackson, R. H. F., , and G. P. McCormic, 1988: Second order sensitivity analysis in factorable programming: Theory and applications.

,*Math. Program***41****,**1–28.Kalnay, E., , S. K. Park, , Z-X. Pu, , and J. Gao, 2000: Application of the quasi-inverse method to data assimilation.

,*Mon. Wea. Rev***128****,**864–875.Le Dimet, F. X., , and I. Charpentier, 1998: Methodes de second order en assimilation de donnees.

*Equations aux Dérivées Partielles et Applications (Articles Dédiées à Jacques-Louis Lions*), Gauthier-Villars, 623–640.Le Dimet, F. X., , H. E. Ngodock, , B. Luong, , and J. Verron, 1997: Sensitivity analysis in variational data assimilation.

,*J. Meteor. Soc. Japan***75B****,**245–255.Lehoucq, R. B., , D. C. Sorensen, , and C. Yang, 1998:

*ARPACK Users' Guide: Solution of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods*. Software, Environments, and Tools, Vol. 6, SIAM, 160 pp.Liu, D. C., , and J. Nocedal, 1989: On the limited memory BFGS method for large scale minimization.

,*Math. Progam***45****,**503–528.Moré, J. J., , B. S. Garbow, , and K. E. Hillstrom, 1981: Testing unconstrained optimization software.

,*ACM Trans. Math. Software***7****,**17–41.Nash, S. G., 1984a: Newton-type minimization via the Lanczos method.

,*SIAM J. Numer. Anal***21****,**770–788.Nash, S. G., 1984b: Truncated-Newton methods for large-scale function minimization.

*Applications of Nonlinear Programming to Optimization and Control,*H. E. Rauch, Ed., Pergamon Press, 91–100.Nash, S. G., 1984c: User's guide for TN/TNBC: Fortran routines for nonlinear optimization. Mathematical Sciences Dept. Tech. Rep. 307, The Johns Hopkins University, 17 pp.

Nash, S. G., 1984d: Solving nonlinear programming problems using truncated-Newton techniques.

*Numerical Optimization,*P. T. Boggs, R. H. Byrd, and R. B. Schnabel, Eds., SIAM, 119–136.Nash, S. G., 1985: Preconditioning of truncated-Newton methods.

,*SIAM J. Sci. Stat. Comput***6****,**599–616.Nash, S. G., 2000: A survey of truncated-Newton methods.

,*J. Comput. Appl. Math***124****,**45–59.Nash, S. G., , and A. Sofer, 1989a: Block truncated-Newton methods for parallel optimization.

,*Math. Progam***45****,**529–546.Nash, S. G., , and A. Sofer, 1989b: A parallel line search for Newton type methods in computer science and statistics.

*Proc. 21st Symp. on the Interface: Computing Science and Statistic,*Orlando, FL, American Statistical Association, 134–137.Nash, S. G., , and J. Nocedal, 1991: A numerical study of the limited memory BFGS method and the truncated-Newton method for large scale optimization.

,*SIAM J. Optim***1****,**358–372.Nash, S. G., , and J. Nocedal, 1996:

*Linear and Nonlinear Programming*. McGraw-Hill, 692 pp.Ngodock, H. E., 1996: Data assimilation and sensitivity analysis. Ph.D. thesis, University Joseph Fourier, Grenoble, France, 213 pp.

Nocedal, J., 1980: Updating quasi-Newton matrices with limited storage.

,*Math. Comput***35****,**773–782.Nocedal, J., , and S. J. Wright, 1999:

*Numerical Optimization*. Springer Verlag Series in Operations Research, 656 pp.O'Leary, D. P., 1983: A discrete Newton algorithm for minimizing a function of many variables.

,*Math. Progam***23****,**20–23.Powell, M. J. D., , and P. L. Toint, 1979: Estimation of sparse Hessian matrices.

,*SIAM J. Numer. Anal***16****,**1060–1074.Pu, Z. X., , and E. Kalnay, 1999: Targeting observations with the quasi-inverse linear and adjoint NCEP global models: Performance during FASTEX.

,*Quart. J. Roy. Meteor. Soc***125****,**3329–3337.Pu, Z. X., and and Coauthors, 1997: Sensitivity of forecast errors to initial conditions with a quasi-inverse linear method.

,*Mon. Wea. Rev***125****,**2479–2503.Rabier, F., , and P. Courtier, 1992: Four dimensional assimilation in the presence of baroclinic instability.

,*Quart. J. Roy. Meteor. Soc***118****,**649–672.Rabier, F., and and Coauthors, 1997: Recent experimentation on 4D-Var and first results from a simplified Kalman filter. ECMWF Tech. Memo. 240, 42 pp.

Reuther, J. J., 1996: Aerodynamic shape optimization using control theory. Ph.D. dissertation, University of California, Davis, 226 pp.

Santosa, F., , and W. W. Symes, 1988: Computation of the Hessian for least-squares solutions of inverse problems of reflection seismology.

,*Inverse Problems***4****,**211–233.Santosa, F., , and W. W. Symes, 1989:

*An Analysis of Least Squares Velocity Inversion*.*Geophysical Monogr.*, Vol. 4, Society of Exploration Geophysicists, 168 pp.Schlick, T., , and A. Fogelson, 1992a: TNPACK—A truncated Newton minimization package for large-scale problems: I. Algorithm and usage.

,*ACM Trans. Math. Software***18****,**46–70.Schlick, T., , and A. Fogelson, 1992b: TNPACK—A truncated Newton minimization package for large-scale problems: II. Implementation examples.

,*ACM Trans. Math. Software***18****,**71–111.Sleijpen, G. L. G., , and H. A. van der Vorst, 1996: A Jacobi–Davidson iteration method for linear eigenvalue problems.

,*SIAM J. Matrix Anal***17A****,**401–425.Symes, W. W., 1990: Velocity inversion: A case study in infinite-dimensional optimization.

,*Math. Program***48****,**71–102.Symes, W. W., 1991: A differential semblance algorithm for the inverse problem of reflection seismology.

,*Comput. Math. Appl***22****,**(4/5),. 147–178.Symes, W. W., 1993: A differential semblance algorithm for the inversion of multioffset seismic reflection data.

,*J. Geophys. Res***98****,**(B2),. 2061–2073.Thacker, W. C., 1989: The role of Hessian matrix in fitting models to measurements.

,*J. Geophys. Res***94****,**6177–6196.Thepaut, J-N., , and P. Moll, 1990: Variational inversion of simulated TOVS radiances using the adjoint technique.

,*Quart. J. Roy. Meteor. Soc***116****,**1425–1448.Thepaut, J-N., , and P. Courtier, 1991: Four-dimensional variational assimilation using the adjoint of a multilevel primitive equation model.

,*Quart. J. Roy. Meteor. Soc***117****,**1225–1254.Veerse, F., 1999: Variable-storage quasi-Newton operators as inverse forecast/analysis error covariance matrices in data assimilation. INRIA Tech. Rep. 3685, 28 pp.

Wang, Z., 1993: Variational data assimilation with 2-D shallow water equations and 3-D FSU global spectral models. Tech. Rep. FSU-SCRI-93T-149, The Florida State University, Tallahassee, FL, 235 pp.

Wang, Z., , I. M. Navon, , F. X. Le Dimet, , and X. Zou, 1992: The second order adjoint analysis: Theory and application.

,*Meteor. Atmos. Phys***50****,**3–20.Wang, Z., , I. M. Navon, , and X. Zou, 1993: The adjoint truncated Newton algorithm for large-scale unconstrained optimization. Tech. Rep. FSU-SCRI-92-170, The Florida State University, Tallahassee, FL, 44 pp.

Wang, Z., , I. M. Navon, , X. Zou, , and F. X. Le Dimet, 1995: A truncated-Newton optimization algorithm in meteorology applications with analytic Hessian/vector products.

,*Comput. Optim. Appl***4****,**241–262.Wang, Z., , K. K. Droegemeier, , L. White, , and I. M. Navon, 1997: Application of a new adjoint Newton algorithm to the 3D ARPS storm-scale model using simulated data.

,*Mon. Wea. Rev***125****,**2460–2478.Wang, Z., , K. K. Droegemeier, , and L. White, 1998: The adjoint Newton algorithm for large-scale unconstrained optimization in meteorology applications.

,*Comput. Optim. Appl***10****,**283–320.Yang, W., , I. M. Navon, , and P. Courtier, 1996: A new Hessian preconditioning method applied to variational data assimilation experiments using an adiabatic version of NASA/GEOS-1 GCM.

,*Mon. Wea. Rev***124****,**1000–1017.Zou, X., , I. M. Navon, , F. X. Le Dimet, , A. Nouailler, , and T. Schlick, 1990: A comparison of efficient large-scale minimization algorithms for optimal control applications in meteorology. Tech. Rep. FSU-SCRI-90-167, The Florida State University, Tallahassee, FL, 44 pp.

Zou, X., , I. M. Navon, , and F. X. Le Dimet, 1992: Incomplete observations and control of gravity waves in variational data assimilation.

,*Tellus***44A****,**273–296.Zou, X., , I. M. Navon, , M. Berger, , P. K. H. Phua, , T. Schlick, , and F. X. Le Dimet, 1993: Numerical experience with limited-memory quasi-Newton methods and truncated Newton methods.

,*SIAM J. Numer. Optim***3****,**582–608.

# APPENDIX

## Computational Complexity of AD Calculation of the Second-Order Adjoint

*F*

**w**

^{T}

*F*

Here **w** is a vector of positive weights that depend on the computing system and represent the number of clock cycles needed for fetching and/or storing data items, multiplication, addition, and finally for taking into account nonlinear operations.

**w**

^{T}assumes the form

**w**

^{T}

*μ,*

*π,*

*ν*

*μ*≥ max(1,

*π*/2),

*π*≤ 1 and

*ν*≤ 2

*π.*For example, this assumption implies that a memory access (

*μ*) is at least as slow as an addition or half a multiplication (

*π*). Griewank (2000) derives the computational complexity of the tangent model (directional derivative)

*w*

_{tang}, gradient (first order adjoint)

*w*

_{grad}, and second-order adjoint

*w*

_{SOA}normalized by the complexity of the model evaluation as

As mentioned by Nocedal and Wright (1999) automatic differentiation has increasingly been using more sophisticated techniques that allow, when used in reverse mode, for calculation either of full Hessians or Hessian/vector products. However the automatic differentiation technique should not be regarded as a substitute for the user to think that this is a fail-safe product and each derivative calculation obtained with AD should be carefully assessed.

Gay (1996) has shown how to use partial separability of the Hessian in AD while Powell and Toint (1979) and Coleman and More (1984), along with Coleman and Cai (1986), have shown how to estimate a sparse Hessian using either graph-coloring techniques or other highly effective schemes.

Software for the estimation of sparse Hessians is available in the work of Coleman et al. (1985a,b). See also the work of Dixon (1991) and the general presentation of Gilbert (1992).

Averbukh et al. (1994) supplemented the work of Moré et al. (1981), which provides function and gradient subroutines of 18 test functions for multivariate minimization. Their supplementary Hessian segments enable users to test optimization software that requires second derivative information.

First five largest and smallest computed Ritz values of the Hessian matrix and the corresponding relative residuals. The Hessian is evaluated at the initial guess point

First five largest and smallest computed Ritz values of the Hessian matrix and the corresponding relative residuals. The Hessian is evaluated at the computed optimal point

^{}

* This work is dedicated to the late Professor Jaques Louis Lions in deep homage.