## 1. Introduction

This paper concerns the estimation of unknown forecast and observation error covariance parameters for an atmospheric data assimilation system from observational residuals. The method we present is based on maximum-likelihood covariance parameter estimation as described by Dee (1995). In a companion paper (Dee et al. 1999, hereafter referred to as Part II) we describe three different applications, involving univariate as well as multivariate covariance models, and data from both stationary and moving observing systems.

The simplest example of a covariance parameter, which is required for atmospheric data assimilation, is the error standard deviation of rawinsonde height measurements at a fixed pressure level. The traditional method (Gandin 1963; Rutherford 1972) for estimating this parameter computes the least squares fit of an isotropic correlation model to the sample spatial correlations of a collection of *observed-minus-forecast residuals.* These are the differences between the heights reported by the instrument and the corresponding (interpolated) predictions obtained from a short-term forecast. The data typically span a period of 1–3 months, and involve quality-controlled reports from a fixed set of stations. The sample correlations among all pairs of stations can be plotted as a function of separation distance, together with a curve representing a fitted correlation model (see Fig. 1). By extrapolating this curve to the origin one can determine the ratio between the observation and forecast error standard deviations.

This procedure can be generalized in several ways. For example, wind error standard deviations can be estimated by separating observed-minus-forecast wind residuals into their longitudinal and transverse components (Daley 1985). Hollingsworth and Lönnberg (1986) and Lönnberg and Hollingsworth (1986) used this technique in their extensive and groundbreaking study of the statistical structure of multivariate forecast errors. They also showed how the same procedure can be used to estimate the vertical correlations of forecast and observation errors from multilevel observed-minus-forecast residuals. Additional applications involving anisotropic, multivariate, and nonseparable covariance models have been described by Thiébaux et al. (1986; 1990), Bartello and Mitchell (1992), and Devenyi and Schlatter (1994).

A different method must be used to study the error characteristics of observations originating from moving platforms such as ships, aircraft, and satellites. These data compose the vast majority of atmospheric observations and potentially contain a wealth of information about forecast errors as well. The observations represent the *only* real source of information about forecast errors:all other benchmarks such as verifying analyses are derived from them. For this reason we feel it is important to develop additional techniques for extracting maximum information from the observations themselves.

The purpose of this paper is to present a flexible and mathematically rigorous covariance parameter estimation method that does not involve a priori restrictions on the nature of either the observing system or the covariance model. We estimate the covariance parameters by maximizing the likelihood function of the parameters given the data. Advantages of the maximum-likelihood method over other estimation methods are that 1) it can incorporate information about the error distributions to the extent that such information is available and 2) it is consistent with current operational atmospheric data assimilation systems, all of which may be regarded as particular implementations of the maximum-likelihood method to the problem of estimating the state of the atmosphere from observations and model information (Lorenc 1986; Cohn 1997). Furthermore, the maximum-likelihood method generates useful information about the accuracy of the parameter estimates along with the estimates themselves.

In this study as well as in Part II we especially try to address the limitations of the method, some of which are intrinsic to the problem at hand. For example, the traditional method described earlier relies on the basic assumption that the spatially uncorrelated component of the residual arises from observation errors, while the spatially correlated component represents forecast errors. This illustrates two fundamental aspects of the problem of estimating the error statistics for one particular source of information by comparing it with another. The first is that the error distributions associated with each source must be sufficiently different in order that their respective parameters can be reliably estimated from the residuals. This is the issue of *parameter identifiability* that we discuss in section 4c: *no* method can produce meaningful estimates of poorly identifiable parameters.

The second fundamental aspect of the problem is that it is necessary to describe each of the error distributions in sufficient detail, in order that the only remaining unknowns are the parameters one wishes to estimate. The simple example described earlier actually involves a myriad of simplifying assumptions about forecast and observation errors, such as their statistical independence, their local homogeneity in space and time, and the isotropy of forecast error correlations. This raises the question of *robustness* of the parameter estimates; that is, the sensitivity of the estimates to the various modeling assumptions involved in the formulation of the problem. For example, it can be seen from Fig. 1 that the estimate of the observation and forecast error standard deviations obtained by extrapolation must depend, to some extent, on the choice of the isotropic model used to represent the forecast error correlations. We discuss this issue in section 4e and devote considerable attention to it in Part II.

The organization of this paper is as follows. In section 2 we define forecast and observation errors and discuss covariance modeling in general terms. Appendix A describes a few different isotropic correlation models that we use here and in Part II. In section 3 we describe the relationship between the covariance models and the data, represented by the observed-minus-forecast residuals. We also consider the possibility of using residuals obtained from two separate observing instruments. The heart of this paper is in section 4, where we discuss in detail the application of the maximum-likelihood method to the problem of estimating any unknown covariance parameters from the data. We briefly refer to the generalized cross-validation method (Wahba and Wendelberger 1980), which is reviewed in appendix B. Section 5 contains concluding remarks.

## 2. Covariance models

*n*-vector

**w**

^{f}

_{k}

*t*

_{k}, and

**w**

^{t}

_{k}

**w**

^{t}

_{k}

*n*-vector as well, containing, for example, the true gridpoint values or spectral coefficients. The

*forecast error*is then simply

**ϵ**

^{f}

_{k}

**w**

^{f}

_{k}

**w**

^{t}

_{k}

*p*

_{k}-vector

**w**

^{0}

_{k}

*t*

_{k}, the

*observation error*is defined by

**ϵ**

^{0}

_{k}

**w**

^{0}

_{k}

**h**

_{k}

**w**

^{t}

_{k}

*p*

_{k}-vector function

**h**

_{k}is the

*discrete forward observation operator*(e.g., Cohn 1997), mapping model variables to the data type associated with the instrument.

**b**

^{f}

_{k}

**ϵ**

^{f}

_{k}

**P**

^{f}

_{k}

**ϵ**

^{f}

_{k}

**b**

^{f}

_{k}

**ϵ**

^{f}

_{k}

**b**

^{f}

_{k}

^{T}

**b**

^{0}

_{k}

**ϵ**

^{0}

_{k}

**R**

_{k}

**ϵ**

^{0}

_{k}

**b**

^{0}

_{k}

**ϵ**

^{0}

_{k}

**b**

^{0}

_{k}

^{T}

**X**

_{k}

**ϵ**

^{0}

_{k}

**b**

^{0}

_{k}

**ϵ**

^{f}

_{k}

**b**

^{f}

_{k}

^{T}

*ensemble averaging*or

*expectation operator,*whose proper definition involves the (typically unknown) joint probability distribution of forecast and observation errors.

The observation operator **h**_{k} and its associated observation error distribution are different for each data type. It is of course possible, and sometimes convenient, to combine all available observations at a time *t*_{k} into the observation vector **w**^{0}_{k}**w**^{0}_{k}

All operational data assimilation systems rely on approximate information about error means and covariances. In practice (3)–(5) are modeled by introducing various simplifying assumptions about the underlying error distributions. For example, the means **b**^{f}_{k}**b**^{0}_{k}**X**_{k} ≡ 0. This assumption ignores the fact that both types of error depend on the local state of the atmosphere and must therefore be correlated with each other. The most familiar example of state-dependent observation error is representativeness error (Daley 1993); model error is also easily shown to be state dependent [Dee 1995, Eqs. (4) and (5)]. We will mention other assumptions about errors associated with specific data types below and in Part II. Some are not necessarily realistic, but in practice the information required to remove them is usually lacking.

In general, theoretical statements about the error distributions combined with practical considerations lead to specific *covariance models* for the forecast and observation errors. Typically such models involve unknown parameters, which must then be estimated from actual atmospheric data.

For example, quality-controlled rawinsonde observations are usually regarded as unbiased measurements of the true atmospheric state. Measurement errors associated with separate vertical soundings are assumed independent, and the errors for the different measurement variables (temperature, relative humidity, and wind components) are assumed to be independent as well. The statistical properties of the errors in all individual, univariate soundings are generally taken to be identical; that is, independent of time and station.

**R**

**R**

^{(mn)}

_{ij}

*σ*

^{(m)}

*σ*

^{(n)}

*ν*

^{(mn)}

*δ*

*r*

_{ij}

**R**

^{(mn)}

_{ij}

*i,*level

*m,*with the error at station

*j,*level

*n.*The parameter

*σ*

^{(m)}is the observation error standard deviation at pressure level

*m,*and

*ν*

^{(mn)}is the vertical correlation between errors at levels

*m*and

*n.*The quantity

*r*

_{ij}is the (horizontal) distance between stations

*i*and

*j,*and Thus, a complete univariate rawinsonde observation error covariance model for each measurement variable is determined by the set of parameters {

*σ*

^{(m)},

*ν*

^{(mn)}}.

The nature of forecast errors is more complicated, primarily because model errors are inherently multivariate and correlated in space and time. By expressing these properties in a forecast error covariance model, the information contained in a set of localized, univariate observations can be exploited to estimate multivariate atmospheric fields, even in regions where no observations exist. In order for such estimates to be meaningful, it is of course necessary that the covariance model formulations be sufficiently realistic. Forecast error covariance modeling is an active field of research that we will not attempt to review here. Rather, given a particular formulation of a forecast error covariance model, our concern in this work is to determine the best set of parameters for the model based on the available observations.

**P**

^{(mn)}

_{ij}

*σ*

^{(m)}

*σ*

^{(n)}

*ν*

^{(mn)}

*ρ*

*r*

_{ij}

*L*

^{(m)}

*L*

^{(n)}

*ρ*(

*r*

_{ij};

*L*

^{(m)},

*L*

^{(n)}) represents spatial correlations while the quantity

*ν*

^{(mn)}

*ρ*(0;

*L*

^{(m)},

*L*

^{(n)}) is the correlation between errors at locations at levels

*m*and

*n*but in the same vertical column. Such models may be called

*quasi-separable*because the vertical correlations are invariant with respect to translation along levels of constant pressure.

First consider the restriction of the model (8) to a fixed pressure level *m,* where it depends on two parameters only: the error standard deviation *σ*^{(m)} and a decorrelation length scale *L*^{(m)}. The function *ρ*(*r*; *L*^{(m)}) ≡ *ρ*(*r*; *L*^{(m)}, *L*^{(m)}) represents the (horizontal) isotropic correlations between errors at any two locations on a fixed pressure level. Only a special class of functions of *r* gives rise to a legitimate (i.e., positive-semidefinite) fixed-level covariance model on a spherical surface; appendix A describes a few such functions as well as a definition of the decorrelation length-scale parameter *L.* Isotropic models derive from an assumption that the correlation between errors at any two locations depends only on the distance between the two locations: the isolines of the correlation functions are circular, and the parameter *L* controls the distance between the contours. The isotropic assumption is clearly not valid for actual forecast errors, which generally depend on local properties of the flow. The widespread use of isotropic univariate covariance models in atmospheric data assimilation systems can be explained by the fact that error correlations have traditionally been calculated by averaging data over relatively long periods of time, for example, 1–3 months (Rutherford 1972; Hollingsworth and Lönnberg 1986; Lönnberg and Hollingsworth 1986;Bartello and Mitchell 1992).

*m*and

*n*that are not in the same vertical column requires the specification of the function

*ρ*(

*r*

_{ij};

*L*

^{(m)},

*L*

^{(n)}),

*m*≠

*n.*Special care must be taken in constructing such a function so that its restriction to any fixed pressure level has the prescribed decorrelation length scale, while it still gives rise to a model that is positive-semidefinite on the domain of interest. This problem and its solution are addressed in detail by Gaspari and Cohn (1999). For estimating the vertical correlations between the errors at any two levels

*m*and

*n*we use the approximation

*ρ*

*r*

_{ij}

*L*

^{(m)}

*L*

^{(n)}

*ρ*

*r*

_{ij}

*L*

^{(m)}

*ρ*

*r*

_{ij}

*L*

^{(n)}

Models of the form (8) will be used in Part II of this study to describe univariate forecast error covariances. With the approximation (9), such a model is completely determined by the parameters {*σ*^{(m)}, *L*^{(m)}, *ν*^{(mn)}} and by the choice of the function *ρ*(*r*; *L*). We will consider a number of alternatives for this function, primarily in order to examine the effect on the parameter estimates of some of the uncertainties inherent in the description of forecast errors. In any case, estimation from regional time series data will, at best, produce parameter estimates that are representative of the actual forecast errors averaged over the space and time domain of the data.

^{u}, ε

^{υ}denote the wind error components and define an

*error stream function ψ*and

*error velocity potential χ.*Then one may write

Note that the stream function and velocity potential, in the present context, are associated with the error fields rather than with the flow itself. A multivariate wind error covariance model can then be constructed based on separate univariate covariance models for *ψ* and *χ.* The simplest such model results from the assumption that *ψ* and *χ* are statistically independent, and that the covariance of each can be modeled by (8).

For the applications in Part II we will use this simple approach to represent forecast wind error covariances at fixed pressure levels. This model does not provide any information about the cross covariances between wind errors and height errors. The coupling between wind and height errors is known to be strong in midlatitudes; this information must be incorporated in a multivariate forecast error covariance model in order to take full advantage of the observations in a data assimilation system (Hollingsworth and Lönnberg 1986). However, if the goal is only to estimate wind observation error covariance parameters, then a forecast error covariance model based on (10) will serve the purpose.

**P**

^{f}

_{k}

**P**

^{f}

_{k}

*α*^{f}

**R**

_{k}

**R**

_{k}

*α*^{0}

**X**

_{k}

**X**

_{k}

*α*^{x}

*α*^{f},

*α*^{0}, and

*α*^{x}unknown parameters whose definition depends on the particular modeling assumptions. Our goal will be to determine values for these parameters that, in a sense to be made precise, are most compatible with the data.

## 3. Observational residuals

*observed-minus-forecast residuals*defined by

**v**

_{k}

**w**

^{0}

_{k}

**h**

_{k}

**w**

^{f}

_{k}

*p*

_{k}-vector time series {

**v**

_{k}} depends on actual observation and forecast errors, since

**v**

_{k}

**ϵ**

^{0}

_{k}

**H**

_{k}

**ϵ**

^{f}

_{k}

**H**

_{k}, a

*p*

_{k}×

*n*-matrix, is defined by Equation (13) is obtained by linearizing (12) about the forecast state and using (1) and (2). The accuracy of (13) depends on the size of the forecast errors; it is exact for linear observation operators.

*observed-minus-observed residuals*

**r**

_{k}

**w**

^{o1}

_{k}

**w**

^{o2}

_{k}

**h**

^{1}

_{k}

**h**

^{2}

_{k}

**r**

_{k}

**ϵ**

^{o1}

_{k}

**ϵ**

^{o2}

_{k}

*t*

_{k}. If the retrievals are collocated with a set of rawinsonde temperature observations, then the residuals (15) can be computed. In this case the observation operators, although very different, are compatible. Some kind of interpolation will be required in order to collocate the retrievals with the rawinsonde observations, and therefore (16) is not exact. Provided the interpolation errors are small compared with the observation errors themselves, the observed-minus-observed residuals contain useful information about the observation errors associated with the two data types.

**H**

_{k}· 〉 ≈

**H**

_{k}〈 · 〉;both (17) and (18) are exact for linear observation operators.

**v**

_{k}

*μ*_{k}

*μ*_{k}as known; see, however, section 4b below.

**v**

_{k}

*μ*_{k}

**v**

_{k}

*μ*_{k}

^{T}

**S**

_{k}

*α***r**

_{k}〉 ≈

**b**

^{o1}

_{k}−

**b**

^{o2}

_{k}

**r**

_{k}− 〈

**r**

_{k}〉)(

**r**

_{k}− 〈

**r**

_{k}〉)

^{T}〉 ≈

**R**

^{1}

_{k}−

**Y**

_{k}−

**Y**

^{T}

_{k}+

**R**

^{2}

_{k}

**Y**

_{k}is the cross-covariance between the observation errors,

**Y**

_{k}

**ϵ**

^{o1}

_{k}

**b**

^{o1}

_{k}

**ϵ**

^{o2}

_{k}

**b**

^{o2}

_{k}

^{T}

**r**

_{k}

*ζ*_{k}

*ζ*_{k}known. We then have

**r**

_{k}

*ζ*_{k}

**r**

_{k}

*ζ*_{k}

^{T}

**T**

_{k}

*α***Y**

_{k}as well. In this case, models for observation error covariances imply a model for the observed-minus-observed residuals; Eq. (27) establishes a relationship between the models and the data analogous with (20).

## 4. Covariance parameter estimation

The previous section describes the relationship between error covariance models and data residuals. We now consider a general method for adjusting the free model parameters in order to improve the consistency between the models and a given finite subset of the data. In order to keep the presentation simple we will use the notation for observed-minus-forecast residuals: the data are **v**_{k} and the covariance model is **S**_{k}(** α**), although the method applies equally well to observed-minus-observed residuals.

### a. Maximum-likelihood estimation

One way to fit a model to a dataset is by maximizing the likelihood that the actual observed data did, in fact, arise from the model. To be precise, suppose that the actual sequence of residuals {**v**_{k}} is a realization of a multivariate stochastic process {**V**_{k}}, whose joint probability density function (pdf) is *p*({**v**_{k}}; ** α**). If the functional form of the pdf is known, then its value for a fixed dataset {

**v**

_{k},

*k*= 1, . . . ,

*K*} depends on

**only: the function of**

*α***thus defined is called the**

*α**likelihood function*(Fisher 1922). The

*maximum-likelihood estimate*

*α̂***V**

_{k}} with covariances given by (22). To this end we postulate that the process is white and Gaussian, with covariances at times

*t*

_{k}given by

**S**

_{k}(

**) for some**

*α***. We also assume that the means**

*α*

*μ*_{k}are known, or that they can be estimated independently. Using the familiar expression for the multivariate Gaussian pdf (Jazwinski 1970, section 2.4), The maximum-likelihood estimate

*α̂***or equivalently, by minimizing the**

*α,**log-likelihood function*Note that this expression depends on the data, is therefore random, and that there is no guarantee of a unique minimum.

The assumption that the process {**V**_{k}} is white and Gaussian is not essential; the pdf (30) can be replaced by any other. For example, it is straightforward to incorporate a description of the serial correlations between successive residuals. For some data types a lognormal distribution may be more appropriate, although in the applications we have studied so far the Gaussian assumption appears to be adequate. In any case, the sensitivity of the parameter estimates to assumptions about the pdf of the data should be addressed experimentally;we will return to this issue in section 4e below and in Part II.

For a fixed dataset {**v**_{k}} and given formulations of the covariance models **S**_{k}(** α**), the function

*f*(

**) can be minimized using standard optimization software (e.g., Press et al. 1992, chapter 10). Forecast and observation error covariance models implemented in current atmospheric data assimilation systems are relatively simple to evaluate; they have to be in order for the assimilation of large volumes of data to be computationally viable. The effort involved in tuning models such as (6) and (8) therefore depends primarily on the size of the dataset. As we show below, the error variance in the parameter estimates is proportional to 1/**

*α***, where**

*ν***is the total number of data. The constant of proportionality varies from case to case (it depends on the identifiability of the parameters), but as a rule it requires on the order of a hundred observations to estimate a single parameter with meaningful accuracy.**

*ν*The log-likelihood function (31) is formulated for the general case of time-dependent forecast and/or observation error covariances; see (22). This is necessary, for example, when the observation operator depends on time, which is the case for observations originating from moving platforms such as ships, aircraft, or satellites. Time-dependent covariance models **S**_{k}(** α**) can also result from expressing forecast error covariances in terms of local atmospheric flow and/or thermodynamic conditions; see Riishøjgaard (1998) for an example of such a model.

**S**

_{k}

*α***S**

*α**f*

*α***S**

*α***S**

^{−1}

*α***S**

**S**

**S**

*unconstrained*maximum-likelihood covariance estimate for a white, stationary Gaussian time series (e.g., Muirhead 1982). Minimization of (33) provides instead the

*constrained*maximum-likelihood covariance estimate—constrained to be of the form

**S**

**) for some**

*α***. This approach to the estimation of structured covariance matrices from stationary time series was first proposed by Burg et al. (1982).**

*α*The stationary form (33) of the log-likelihood function is appropriate for estimating covariance parameters from station data. Most covariance models currently implemented in operational data assimilation systems are independent of the state of the system and generally vary slowly with time, if at all. In this case it is convenient to first compute the sample covariance matrix (34), and then to minimize (33) with respect to ** α**. This procedure is similar to the traditional method described in the introduction, except that the parameter estimates are based on the maximum-likelihood criterion rather than on a least squares fit. The two norms are different in general since the least squares procedure does not take into account any a priori information about the probability distribution of the data.

**S**

*i*and

*j*can then be estimated from this set by where [ · ]

_{i}denotes the element associated with station

*i*and

*K*

_{ij}is the number of simultaneous reports at stations

*i*and

*j.*If

*K*

_{ij}is small then

**S**

_{ij}is generally not an accurate estimate of the covariance between stations

*i*and

*j.*One might exclude a number of stations in order to ensure that all

*K*

_{ij}exceed a certain threshold; however, our experience with the maximum-likelihood method is that the parameter estimates are rather insensitive to this.

### b. Bias estimation

In order to implement the estimation procedure, it is necessary to specify the residual means *μ*_{k} in (31) or in (34). These depend on the mean observation errors **b**^{0}_{k}**b**^{f}_{k}

There are two choices for the purpose of tuning covariance models. The first is to simply ignore the bias by taking *μ*_{k} = 0. This choice will, of course, affect the parameter estimates. For example, variance parameters will tend to be overestimated when bias is ignored. This approach is not unreasonable if the tuned covariance models are to be used for a statistical analysis where bias is not explicitly accounted for. In that case the total (systematic plus random) root-mean-square analysis error will actually be smallest when the forecast and observation error variances are suitably inflated in order to account for the bias (Dee and da Silva 1998).

The alternative is to estimate the residual mean *μ*_{k} prior to, or concurrent with, the estimation of covariance parameters. If independent information about forecast and/or observation bias is available, then this information should obviously be used. In practice this is unlikely to be the case and therefore *μ*_{k} must be estimated from the data.

*μ*_{k}

*μ*_{k}

**β****and**

*α**Estimating bias parameters in this fashion amounts to a weighted least squares bias estimation procedure in which the weights (determined by the covariances) are adjusted adaptively. Although the generality of this approach is appealing, we do not expect it to be practical. The difficulty in bias estimation lies not so much in the techniques as in the ability to formulate sensible bias models.*

**β**.

*μ*_{k}by calculating the time-mean residuals. In the stationary case (i.e., for stationary observing systems) we then have

*μ*_{k}

_{i}

*μ*_{i}

*K*

_{i}the number of reports from station

*i.*In the general (nonstationary) case when observation locations vary with time, one might define a spatially varying bias estimate

*μ*The presence of bias that is not properly accounted for will generally result in inaccurate covariance parameter estimates. More importantly, biased data and/or forecasts will result in a biased analysis, independently of the covariance models used. Although it is not the subject of this study, bias estimation and correction must take precedence over covariance modeling and estimation.

### c. Identifiability of the parameters

The maximum-likelihood method is appealing for its generality; it can be used to estimate any set of parameters of the pdf, provided those parameters are *jointly identifiable.* This is a fundamental requirement for the estimation problem to be well posed. There exist different technical definitions of this notion (e.g., Chavent 1979), but we will take it to mean that the log-likelihood function (31) must have a unique global minimum with probability one; that is, for almost all realizations of the process {**V**_{k}}. In practice this imposes requirements on the model formulation as well as on the data: there must be no dependency among the model parameters, and the data must provide an adequate sampling.

Consider, for example, two independent *scalar* random variables *w*^{1} and *w*^{2} with identical means but different variances, and suppose that only the residual *w*^{1} − *w*^{2} is observed. It is clearly impossible to estimate the variances of *w*^{1} and *w*^{2} separately, no matter how many sample residuals are available: only the sum of the variances can be estimated. (The means of *w*^{1} and *w*^{2} are not identifiable either in this case.) Suppose now that **w**^{1} and **w**^{2} are independent *vector* random variables, representing, for example, two spatially distributed random fields. If **w**^{2} is spatially correlated with constant variance but **w**^{1} is spatially uncorrelated with constant variance, then it is, in fact, possible to estimate both variances from the residuals **w**^{1} − **w**^{2}, provided they are sampled at more than a single location. Thus, there is a data requirement as well as a model requirement. This example is prototypical for the applications described in Part II, where the data residuals contain both a spatially correlated and a spatially uncorrelated error component.

*f.*At the minimum the gradient of the log-likelihood function vanishes, so for

**near**

*α*

*α̂**f*(

**) ≈**

*α**f*(

**)**

*α̂***−**

*α***)**

*α̂*^{T}

**A**

**−**

*α***),**

*α̂**f*is a sufficiently smooth function of

**.**

*α*Equation (41) shows that the sensitivity of the log-likelihood function *f* to the parameters ** α** near its minimum is controlled by the Hessian. A small perturbation of

*α̂***A**

*f*by an amount proportional to the corresponding eigenvalue (e.g., Strang 1988, section 7.2). If the Hessian has a large condition number then the identifiability of the parameters is poor along the directions associated with the smallest eigenvalues.

Since the nonlinearity of the function *f* is generally not quadratic, the Hessian matrix **A***local* identifiability of the parameters. Note, however, that the analysis does not depend on any properties of *f* other than its differentiability with respect to the parameters. Identifiability is a notion that is not specifically connected with the maximum-likelihood method; it is simply a practical requirement for any parameter estimation method that is based on minimizing a cost function.

### d. Accuracy of maximum-likelihood parameter estimates

The maximum-likelihood method has many appealing theoretical properties (Cramér 1946); in particular it is *asymptotically efficient.* This means that in the limit of infinite data there is no other unbiased estimator that produces more accurate parameter estimates. In practice, we have only finite datasets at our disposal, and more importantly, many of the assumptions required to implement the method are, in fact, violated. The parameter estimates produced in any realistic application therefore will *not* be true maximum-likelihood estimates. Nevertheless, it is useful to compute the asymptotic accuracy of the maximum-likelihood estimates along with the parameter estimates themselves; as we explain below, this provides important information about the uncertainty of the estimates due to sampling error.

*model hypothesis*holds. By this we mean that all assumptions about the data as expressed by the likelihood function are actually valid. In that case the parameter estimates produced by minimizing (31) are truly the maximum-likelihood estimates. Then it can be shown (e.g., Sorenson 1980, his Theorem 5.4) that, if

*** is the vector of true parameter values, then in the limit of infinite data the estimates approach a normal distribution with Here**

*α***A**

*ν*is the number of degrees of freedom associated with the estimation problem. In the general (nonstationary) case corresponding to (31) we have In the truly stationary case where

*n*

_{k}=

*n*= const, we would have

*ν*=

*nK*; however, (43) should be applied to account for missing data [see the discussion following (35)]. If bias parameters are estimated from the same data used for covariance estimation, then the number of degrees of freedom

*ν*should be reduced accordingly.

The estimation error covariance in (42) is the lower bound of the Cramér–Rao inequality (Sorenson 1980). The Cramér–Rao inequality can be regarded as an *uncertainty principle* for parameter estimation: it expresses the fact that the random nature of the data imposes a fundamental limitation on the accuracy with which parameters of the pdf can be estimated from the data. The Hessian matrix **A**

We routinely use (42) to estimate the standard errors of the parameter estimates under the modeling hypothesis. For any given dataset and covariance model formulation, the validity of (42) for finite *ν* can be checked using a Monte Carlo experiment with synthetic data. The procedure is simply to compute parameter estimates from the output of a random-number generator, making sure that the generator produces normally distributed vector samples whose covariances are given by the covariance model with specified parameters. By repeating this many times, the actual estimation errors due to sampling error can be computed. We performed this procedure for all applications described in Part II; although not reported there, the standard error estimates given by (42) turned out to be quite accurate in all cases.

The maximum-likelihood standard errors represent idealized accuracy estimates; in practice they should be regarded as lower bounds on the true accuracy. These error estimates are useful in practice because they quantify the effect of sampling error, which is the only source of error under the model hypothesis. Thus, the standard error estimates indicate whether a given set of covariance parameters can be actually identified from the available data, and whether the parameter uncertainty due to sampling error is acceptable. Making sure that sampling error is small allows one to investigate other sources of uncertainty by modifying the assumptions underlying the model hypothesis. We include several examples of this type of uncertainty analysis in Part II.

### e. Robustness of the parameter estimates

The fact that parameters associated with a particular covariance model can be estimated from a dataset (i.e., that they are identifiable) does not imply that the estimates are actually meaningful. There are many possible reasons why a tuned covariance model may not in fact provide a good fit to the actual data. First of all, the covariance model may be incorrect for *any* set of parameter values. For example, the model might be isotropic while actual covariances are highly anisotropic: a strong state-dependent component of error cannot be accounted for in an isotropic model. To some extent the validity of the assumptions that enter into the formulation of a covariance model can be examined using standard statistical techniques. This requires long-term monitoring of the actual residuals produced by an operational data assimilation system.

A second group of possible reasons for a poor fit concerns the additional assumptions involved in tuning the model. Even if the covariance model is appropriate, the parameter estimates may be far from optimal because, for example, the bias is handled incorrectly, or the data may be serially correlated. In fact it is very likely that some, if not all, of these violations apply in practice. Yet the maximum-likelihood method depends on a complete specification of the pdf of the data. In the absence of information, it is common practice to default to a standard set of assumptions. For example, lacking any specific indications to the contrary, it is almost always assumed that the data are Gaussian and white. This raises the issue of *robustness* of the maximum-likelihood method with respect to the information it requires; we will address this issue experimentally in Part II.

It is worth noting at this point that all currently operational atmospheric data assimilation systems can be regarded as particular applications of the maximum-likelihood method to the problem of estimating the state of the atmosphere from observations and model information (Lorenc 1986; Cohn 1997). Different assumptions about the underlying probability distributions lead to different solution methods, but in all cases that have been tried so far the errors (after quality control of observations) are assumed to be Gaussian and white. In the present work we try to be consistent in applying this same framework to the estimation of parameters of the covariance models, although the maximum-likelihood method is completely general in this respect.

Let us take the pragmatic point of view, then, that the majority of assumptions about error distributions are made primarily for practical reasons, and not necessarily because they are believed to be valid. Then the log-likelihood function (31) [or (33), in the stationary case] is simply one of many possible cost functions that could be used for fitting a parameterized family of covariance models to a dataset. The traditional fitting procedure for station data described in the introduction, for example, is based on a least squares criterion. An advantage of the maximum-likelihood approach is that it incorporates the same statistical information about forecast and observation errors used in the data assimilation system. However, if the underlying assumptions on the error distributions are wrong, one might legitimately ask whether there are any other criteria that lead to a more robust parameter estimation procedure.

**S**

**) =**

*α***S**

*σ*

_{1},

*σ*

_{2},

**) =**

*θ**σ*

^{2}

_{1}

**S**

_{1}+

*σ*

^{2}

_{2}

**S**

_{2}(

**),**

*θ***S**

_{1}a constant matrix, and both

**S**

_{1}and

**S**

_{2}positive definite. Such a model is sufficiently general for most applications of interest. The first term typically represents observation error covariances, while the second term can be used to model forecast error covariances. The GCV method estimates the parameter

*λ*defined by and possibly additional parameters

**as well, from data residuals. We summarize the GCV estimation procedure in appendix B.**

*θ*The scalar *λ* is actually the single most important parameter of the covariance model (44), being the ratio of the variances of the two signals present in the data residuals. Estimates of the separate variances *σ*_{1} and *σ*_{2} are obtained as a by-product of the GCV estimation procedure. Note that the identifiability requirement still holds; *no* method can produce meaningful estimates of poorly identifiable parameters. See appendix A of Wahba et al. (1995) for a discussion of identifiability in the context of GCV.

The GCV approach as formulated by Wahba et al. (1995) applies to the estimation of the parameters *σ*_{1}, *σ*_{2}, and ** θ** based on a single vector of residuals valid at a fixed time

*t*

_{k}. As in Dee (1995), their method was originally intended to be used online in a data assimilation system, for the adaptive tuning of system parameters. However, it can also be applied offline for covariance estimation based on data {

**v**

_{k}} collected over a finite interval, as we show in appendix B. From a practical point of view the GCV method and the maximum-likelihood method differ only in that they involve different cost functions; compare (B10) with (31) [or, for stationary models, (B15) with (33)]. It is true for both methods that if either the covariance model formulation or the data change, the parameter estimates produced by both methods will also change. Both methods are similar in terms of computational complexity, although we have found that the localization of a minimum of the GCV cost function is occasionally easier when starting from a poor initial guess. The parameter estimates obtained by the two methods from the same dataset can be compared by simply interchanging the cost functions. Experiments reported in Part II indicate that the differences are mostly insignificant.

## 5. Summary and conclusions

We described the application of the maximum-likelihood method to the estimation of unknown observation and forecast error covariance parameters from observational residuals. Issues such as bias estimation and correction, parameter identifiability, estimation accuracy, and robustness of the method were addressed with an eye toward practical application. We briefly discussed the relationship between the maximum-likelihood method and generalized cross-validation (Wahba and Wendelberger 1980). In Part II of this study we describe three different applications involving univariate and multivariate covariance models, and data from both stationary and moving observing systems.

Advantages of the maximum-likelihood method include the fact that it allows one to use a priori information about the error distributions, to the extent that such information is available. The method is consistent with current operational atmospheric data assimilation systems, all of which can be regarded as particular implementations of the maximum-likelihood method applied to the problem of estimating the state of the atmosphere from observations and from model information. We showed in section 4d how the maximum-likelihood method can be used to produce estimates of the effect of sampling error upon parameter uncertainty. By making sure that this effect is small, one can study the variability of the covariance parameters by changing the selection of data. In addition, it is possible to perform an uncertainty analysis on the parameter estimates by modifying the assumptions that enter into the maximum-likelihood formulation.

A fundamental limitation of this and other estimation methods is connected with the identifiability of the covariance parameters. Simultaneous estimation of multiple parameters is possible only when all parameters are jointly identifiable from the data. This imposes requirements on the model formulation as well as on the data. In practice, observation errors and forecast errors can be statistically separated only to the extent that 1) they have characteristics that are distinguishable in the observation space; and 2) these different characteristics are adequately modeled. For example, if the observation error contains a spatially correlated component, then this component may be falsely attributed to forecast error. In a future paper we will present a strategy for dealing with such a situation.

## Acknowledgments

Thanks to Steve Cohn, Roger Daley, Greg Gaspari, Peter Houtekamer, Herschel Mitchell, Chris Redder, Leonid Rukhovets, and Grace Wahba for many stimulating discussions about this work.

## REFERENCES

Bartello, P., and H. L. Mitchell, 1992: A continuous three-dimensional model of short-range forecast error covariances.

*Tellus,***44A,**217–235.Burg, J. P., D. G. Luenberger, and D. L. Wenger, 1982: Estimation of structured covariance matrices.

*Proc. IEEE,***70,**963–974.Chavent, G., 1979: Identification of distributed parameter systems: About the output least square method, its implementation, and identifiability.

*Identification and System Parameter Estimation*:*Proceedings of the Fifth IFAC Symposium*, R. Iserman, Ed., Vol. 1, Pergamon Press, 85–97.Cohn, S. E., 1997: Introduction to estimation theory.

*J. Meteor. Soc. Japan,***75,**257–288.Cramér, H., 1946:

*Mathematical Methods of Statistics.*Princeton University Press, 575 pp.Daley, R., 1985: The analysis of synoptic scale divergence by a statistical interpolation scheme.

*Mon. Wea. Rev.,***113,**1066–1079.——, 1991:

*Atmospheric Data Analysis.*Cambridge University Press, 457 pp.——, 1993: Estimating observation error statistics for atmospheric data assimilation.

*Ann. Geophys.,***11,**634–647.DAO, 1996: Algorithm theoretical basis document version 1.01. Data Assimilation Office, NASA/Goddard Space Flight Center, Greenbelt, MD. [Available online at http://dao.gsfc.nasa.gov/subpages/atbd.html.].

Dee, D. P., 1995: On-line estimation of error covariance parameters for atmospheric data assimilation.

*Mon. Wea. Rev.,***123,**1128–1145.——, and A. M. da Silva, 1998: Data assimilation in the presence of forecast bias.

*Quart. J. Roy. Meteor. Soc.,***124,**269–295.——, G. Gaspari, C. Redder, L. Rukhovets, and A. M. da Silva, 1999:Maximum-likelihood estimation of forecast and observation error covariance parameters. Part II: Applications.

*Mon. Wea. Rev.,*1835–1849.Devenyi, D., and T. W. Schlatter, 1994: Statistical properties of 3-hour prediction errors derived from the mesoscale analysis and prediction system.

*Mon. Wea. Rev.,***122,**1263–1280.Fisher, R. A., 1922: On the mathematical foundations of theoretical statistics.

*Philos. Trans. Roy. Soc. London,***222A,**309–368.Gandin, L. S., 1963:

*Objective Analysis of Meteorological Fields*(in Russian). Israel Program for Scientific Translation, 242 pp.Gaspari, G., and S. Cohn, 1999: Construction of correlation functions in two and three dimensions.

*Quart. J. Roy. Meteor. Soc.,***125,**723–757.Hollingsworth, A., and P. Lönnberg, 1986: The statistical structure of short-range forecast errors as determined from rawinsonde data. Part I: The wind field.

*Tellus,***38A,**111–136.Lönnberg, P., and A. Hollingsworth, 1986: The statistical structure of short-range forecast errors as determined from rawinsonde data. Part II: The covariance of height and wind errors.

*Tellus,***38A,**137–161.Lorenc, A. C., 1986: Analysis methods for numerical weather prediction.

*Quart. J. Roy. Meteor. Soc.,***112,**1177–1194.Lupton, R., 1993:

*Statistics in Theory and Practice.*Princeton University Press, 188 pp.Muirhead, R. J., 1982:

*Aspects of Multivariate Statistical Theory.*Wiley, 673 pp.Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, 1992:

*Numerical Recipes in FORTRAN: The Art of Scientific Computing.*2d ed. Cambridge University Press, 963 pp.Riishøjgaard, L.-P., 1998: A direct way of specifying flow-dependent background error correlations for meteorological analysis systems.

*Tellus,***50A,**42–57.Rutherford, I., 1972: Data assimilation by statistical interpolation of forecast error fields.

*J. Atmos. Sci.,***29,**809–815.Sorenson, H. W., 1980:

*Parameter Estimation: Principles and Problems.*Marcel Dekker, 382 pp.Strang, G., 1988:

*Linear Algebra and its Applications.*3d ed. Academic Press, 505 pp.Thiébaux, H. J., H. L. Mitchell, and D. W. Shantz, 1986: Horizontal structure of hemispheric forecast error correlations for geopotential and temperature.

*Mon. Wea. Rev.,***114,**1048–1066.——, L. L. Morone, and R. L. Wobus, 1990: Global forecast error correlation. Part 1: Isobaric wind and geopotential.

*Mon. Wea. Rev.,***118,**2117–2137.Wahba, G., and J. Wendelberger, 1980: Some new mathematical methods for variational objective analysis using splines and cross-validation.

*Mon. Wea. Rev.,***108,**1122–1145.——, D. R. Johnson, F. Gao, and J. Gong, 1995: Adaptive tuning of numerical weather prediction models: Randomized GCV in three- and four-dimensional data assimilation.

*Mon. Wea. Rev.,***123,**3358–3369.

## APPENDIX A

### Correlation Models

**x**

**y**

*σ*

**x**

*σ*

**y**

*ρ*

**x**

**y**

*σ*(

**x**) a positive real-valued function and ‖

**x**−

**y**‖ the Euclidean distance between locations

**x**and

**y**. The

*representing function ρ*must satisfy certain conditions for (A1) to be a legitimate (i.e., positive semidefinite) covariance model; see Gaspari and Cohn (1999) for details.

In this study and in Part II we consider three alternatives for the representing function *ρ.* In each case the Euclidean distance ‖**x** − **y**‖ is replaced by the so-called *horizontal distance r* = *r*(**x**, **y**), defined as the chordal distance between the orthogonal projections of the locations **x** and **y** onto the earth’s surface. Each alternative generates a legitimate covariance model on the sphere.

*power law*:

*L*is the decorrelation length scale defined by see Daley (1991, section 4.3).

*compactly supported fifth-order piecewise rational function*(Gaspari and Cohn 1999, section 4.3): with

*L*being the decorrelation length scale defined by (A3).

*r*=

*r*∗ = 2

*c,*

*ρ*

_{c}

*r*

*L*

*r*

*c*

*L.*

*windowed power law*:

*ρ*

*r*

*ρ*

_{w}

*r*

*L*

*ρ*

_{p}

*r*

*L*

_{1}

*ρ*

_{c}

*r*

*L*

_{2}

*ρ*(0) = 1,

*ρ*′(0) = 0 for each of the functions considered here, it is easy to show that The support of the windowed power law can be controlled by means of the parameter

*L*

_{2}: the function is identically zero for

*r*>

*r*∗ when If we consider the decorrelation length-scale

*L*as the single free (tunable) parameter in (A8), one should take which follows by substituting (A10) into (A9).

Figure A1 shows plots of the three functions for identical values of the length-scale parameter *L,* as well as their discrete Legendre spectra.

## APPENDIX B

### Multiple-Sample GCV

*σ*

_{1},

*σ*

_{2},

**in (44) based on a single residual**

*θ***v.**It is assumed that

**v**

*μ***v**

*μ***v**

*μ*^{T}

*σ*

^{2}

_{1}

**S**

_{1}

*σ*

^{2}

_{2}

**S**

_{2}

*θ***,**

*μ***S**

_{1}, and

**S**

_{2}(

**) known with the exception of the parameters**

*θ***First, let Then find**

*θ.**λ̂*,

*θ̂***A**

*λ*,

**)**

*θ***I**

*λ*

**S**

^{1/2}

_{1}

**S**

^{−1}

_{2}(

**)**

*θ***S**

^{1/2}

_{1}]

^{−1}

**y**

**S**

^{−1/2}

_{1}(

**v**−

**),**

*μ***S**

^{1/2}

_{1}

**S**

_{1}. This determines

*θ̂**σ̂*

^{2}

_{1}= {trace[

**I**

**A**

*λ̂,*

**)]} ×**

*θ̂**V*(

*λ̂,*

**)**

*θ̂***v**

_{k}} one can simply concatenate the

**v**

_{k}into a single random vector

**v**

**v**

**v**

^{T}

_{1}

**v**

^{T}

_{K}

^{T}

**v**

_{k}are independent. Suppose the mean and covariance models for the

**v**

_{k}are The covariance model (B1) for

**v**is block-diagonal, with blocks given by (B9). It is easily checked that where

**A**

_{k}(

*λ,*

**) = [**

*θ***I**

*λ*

**S**

^{1/2}

_{k1}

**S**

^{−1}

_{k2}(

**)**

*θ***S**

^{1/2}

_{k1}]

^{−1}

**y**

_{k}

**S**

^{−1/2}

_{k1}(

**v**

_{k}−

*μ*_{k}).

**S**

_{k1}=

**S**

_{1}and

**S**

_{k2}=

**S**

_{2}, then the function

*V*(

*λ,*

**) simplifies as follows. For the numerator, where**

*θ***S**