## Abstract

Estimating the mean and the covariance matrix of an incomplete dataset and filling in missing values with imputed values is generally a nonlinear problem, which must be solved iteratively. The expectation maximization (EM) algorithm for Gaussian data, an iterative method both for the estimation of mean values and covariance matrices from incomplete datasets and for the imputation of missing values, is taken as the point of departure for the development of a regularized EM algorithm. In contrast to the conventional EM algorithm, the regularized EM algorithm is applicable to sets of climate data, in which the number of variables typically exceeds the sample size. The regularized EM algorithm is based on iterated analyses of linear regressions of variables with missing values on variables with available values, with regression coefficients estimated by ridge regression, a regularized regression method in which a continuous regularization parameter controls the filtering of the noise in the data. The regularization parameter is determined by generalized cross-validation, such as to minimize, approximately, the expected mean-squared error of the imputed values. The regularized EM algorithm can estimate, and exploit for the imputation of missing values, both synchronic and diachronic covariance matrices, which may contain information on spatial covariability, stationary temporal covariability, or cyclostationary temporal covariability. A test of the regularized EM algorithm with simulated surface temperature data demonstrates that the algorithm is applicable to typical sets of climate data and that it leads to more accurate estimates of the missing values than a conventional noniterative imputation technique.

## 1. Introduction

Because the availability of climatic measurements varies spatially and temporally, sets of climate data are usually incomplete. This circumstance complicates multivariate analyses of climate data. Already the estimation of mean values and covariance matrices, the fundamental statistics from which every multivariate analysis issues, becomes difficult when a dataset is incomplete. For example, mean values and covariance matrices of the earth’s surface temperature are needed to assess whether climate models simulate the spatial and temporal temperature variability adequately. If a complete dataset of surface temperatures were available, the sample mean and the sample covariance matrix of the dataset—provided that measurement errors are negligible—would represent consistent and unbiased estimators of the mean and covariance matrix of the surface temperature in the region and period encompassed by the dataset. But if only an incomplete dataset of surface temperatures is available, a direct estimation of the mean and of the covariance matrix from the available data usually is not admissible. For instance, the sample mean of the available data can be an inaccurate estimate of the mean of the data. And if one would estimate a covariance matrix from all data available in an incomplete dataset, leaving out in the sample covariance matrix the terms involving missing values, the estimated covariance matrix would not necessarily be positive semidefinite and might have negative eigenvalues. But a covariance matrix estimate with negative eigenvalues might lead to erratic results in analyses that, like the principal component analysis, rest upon eigendecompositions of covariance matrices. Moreover, projections of multivariate data onto subspaces—projections onto the empirical orthogonal functions (EOFs) of a principal component analysis, for example—are not well-defined when values of variables are missing.

One might circumvent the difficulties that incomplete data cause in a multivariate analysis by excluding from the analysis all variables for which values are missing. For example, if the dataset under consideration contains yearly records of monthly mean surface temperatures, with each variable of the dataset representing the temperature at one point of a global grid, one could exclude from the analysis all variables, or all grid points, for which at least one temperature value is missing. Having thus restricted the analysis to a complete subset of the data, one could estimate mean values and covariance matrices as sample means and sample covariance matrices, and projections of the reduced dataset onto subspaces would be well-defined. However, excluding variables from the analysis for which only a few values are missing would mean using the available information inefficiently. Methods are therefore needed for estimating mean values and covariance matrices reliably from all information available in an incomplete dataset. Since low-dimensional projections of multivariate data, for example, in the form of spatial averages or principal components, play an important role in the analysis of climate data, methods for the estimation of mean values and covariance matrices should also fill in missing values with plausible imputed values, such that projections of the completed dataset onto subspaces are good approximations of the corresponding projections of the unknown complete dataset.

The estimation of mean values and covariance matrices from incomplete data and the imputation of missing values are closely linked problems. When an estimate of the mean and a positive definite estimate of the covariance matrix of a dataset are available, the missing values in the dataset can be filled in with their conditional expectation values given the available values in the dataset (Buck 1960). Conversely, the mean and the covariance matrix can be estimated from a completed dataset with imputed values filled in for missing values, provided that an estimate of the covariance matrix of the error of the imputed values is also available. The covariance matrix of the error of the imputed values is required because the sample covariance matrix of the completed dataset underestimates the variances and covariances of the data if, as is the case when imputing conditional expectation values, the imputed values come exclusively from the center of the conditional distribution of the missing values given the available values. The expected variances and covariances of the deviations of the missing values from the imputed values, or the expected variances and covariances of the imputation error, must be taken into account in estimating the covariance matrix of the data (Little and Rubin 1987, chapter 3.4).

Since estimates of the mean and of the covariance matrix of an incomplete dataset depend on the unknown missing values, and since, conversely, estimates of the missing values depend on the unknown statistics of the data, estimating the mean and the covariance matrix of an incomplete dataset and imputing missing values generally is a nonlinear problem, which must be solved iteratively. In what follows, an iterative method both for the estimation of mean values and covariance matrices and for the imputation of missing values will be presented. The expectation maximization (EM) algorithm (Dempster et al. 1977) is taken as the point of departure for the development of a regularized EM algorithm that is applicable to incomplete sets of climate data, in which the number of variables typically exceeds the number of records. What will result is an algebraic framework within which some conventional techniques for the imputation of missing values in climate data—for example, the techniques described by Smith et al. (1996) and by Mann et al. (1998)—can be interpreted as approximations to regularized EM algorithms.

With the EM algorithm, the maximum likelihood estimates of the parameters of any probability distribution can be computed from incomplete data [see Little and Rubin (1987, chapter 7) for a review]. For Gaussian data, whose probability distribution can be parameterized by the mean and the covariance matrix, the EM algorithm starts with initial guesses for the mean and the covariance matrix and then cycles through the alternating steps of imputing missing values and re-estimating the mean and the covariance matrix from the completed dataset and from an estimate of the covariance matrix of the imputation error. In the imputation step, the missing values of the dataset are filled in with their conditional expectation values given the available values, and the covariance matrix of the error of the thus imputed values is estimated. The expectations both of the missing values and of the covariance matrix of the imputation error are computed from the estimates of the mean and of the covariance matrix and hence are conditional expectations given these estimates. In the estimation step, the mean and the covariance matrix are re-estimated, whereby the contribution of the conditional imputation error to the covariance matrix is taken into account. The imputation step and the estimation step are iterated until the imputed values and the estimated mean and covariance matrix stop changing from one iteration to the next.

In order to make the EM algorithm applicable to typical sets of climate data, it must be modified. In the EM algorithm for Gaussian data, the conditional expectations of the missing values and of the covariance matrix of the imputation error follow, for each record with missing values, from an analysis of the linear regression of the variables with missing values on the variables with available values, which means that the EM algorithm for Gaussian data is based on iterated linear regression analyses (see, e.g., Little and Rubin 1987, chapter 8). Yet typical sets of climate data, containing thousands of variables but at most a few hundred records from which the statistics of the data can be estimated, are rank-deficient, so that the parameters of the regression models in the EM algorithm, and thus the conditional expectations of the missing values given the available values, are underdetermined. Such underdetermined, or ill-posed, problems can be solved with regularization methods, which impose additional constraints on the solution to render it unique, for example, by requiring smoothness of the completed dataset with imputed values filled in for the missing values [see Hansen (1997) for a survey of regularization methods]. To make the EM algorithm applicable to sets of climate data, the ill-posed problem of estimating regression models from rank-deficient data must be regularized with some such method.

Ill-posed problems in climate research are often regularized by performing multivariate analyses in a truncated principal component basis (see, e.g., Smith et al. 1996; Kaplan et al. 1997; Mann et al. 1998). If a problem is regularized by truncating a principal component analysis, high-frequency or small-scale components of the solution, represented by higher-order principal components, are filtered out. The truncation parameter, specifying the degree of regularization, or, for spatial data, the degree of smoothness, is a discrete regularization parameter, which is often determined with ad hoc techniques (see, e.g., Kaplan et al. 1997).

In the regularized EM algorithm that will be presented, the regression parameters are not estimated in a truncated principal component basis, but regularized regression parameters are computed with a method known to statisticians as ridge regression and to applied mathematicians as Tikhonov regularization (Hoerl and Kennard 1970a,b; Tikhonov and Arsenin 1977). In ridge regression, a continuous regularization parameter controls the degree of regularization imposed on the regression coefficients. High-frequency or small-scale components in the regression coefficients are filtered out not by truncating a principal component analysis, but by gradually damping the amplitudes of higher-order principal components (Hansen 1997, chapter 4.2). In the regularized EM algorithm, the regularization parameter that controls the filtering is determined by generalized cross-validation (Golub et al. 1979), in such a way as to minimize, approximately, the expected mean-squared error of the imputed values. With simulated surface temperature data, it will be demonstrated that the resulting algorithm leads to more accurate estimates of the missing values than a noniterative imputation technique (Smith et al. 1996) that is based on a truncated principal component analysis.

In section 2, a review of the EM algorithm for Gaussian data of full rank introduces the concepts and the notation used throughout this paper. Section 3 takes the EM algorithm for Gaussian data of full rank as the point of departure to develop a regularized variant of the EM algorithm in which ridge regression with generalized cross-validation replaces the potentially ill-posed maximum likelihood estimation of the regression parameters in the conventional EM algorithm. This regularized EM algorithm is applicable to rank-deficient data. The discussion in sections 2 and 3 is abstract; no particular spatial or temporal interpretation is assigned to the variables in a dataset. The variables may represent spatial or temporal or mixed spatiotemporal data. In section 4, it is shown how, by arranging the variables in a dataset into records in different ways, the regularized EM algorithm can estimate, and exploit for the imputation of missing values, spatial covariance matrices and mixed spatiotemporal covariance matrices. In section 5, the regularized EM algorithm is compared with conventional techniques for the imputation of missing values in climate data. Section 6 describes results of a test of the regularized EM algorithm with simulated surface temperature data in which values were deleted in a manner characteristic for observational data. Section 7 summarizes the conclusions and discusses potential applications of the regularized EM algorithm, for example, to the construction of historic surface temperature datasets. The appendix contains a note on estimating the imputation error.

## 2. The expectation maximization algorithm

With the EM algorithm, the parameters of a probability distribution are estimated from incomplete data by maximizing iteratively the likelihood of the available data, the likelihood of the available data being viewed as a function of the parameters (Dempster et al. 1977). The EM algorithm, like all methods for incomplete data that ignore the mechanism causing the gaps in the dataset, rests on the assumption that the missing values in the dataset are missing at random, in the sense that the probability that a value is missing does not depend on the missing value (Rubin 1976). For example, in a dataset with monthly mean surface temperatures on a spatial grid, the missing values are missing at random if correlations between anthropogenic temperature changes and the availability of data are negligible, for then the availability of a temperature measurement usually does not depend on the temperature to be measured. As a contrasting example, the availability of in situ measurements of the windspeeds in hurricanes does depend on the windspeed to be measured, so it would not be justified to assume that the missing values are missing at random. The EM algorithm and the methods that will be derived from it in subsequent sections are only applicable to datasets in which the missing values are missing at random.

The probability distribution of multivariate Gaussian data can be parameterized by the mean and the covariance matrix (i.e., the mean and the covariance matrix are sufficient statistics of the Gaussian distribution). In an iteration of the EM algorithm for Gaussian data, estimates of the mean and of the covariance matrix are revised in three steps. First, for each record with missing values, the regression parameters of the variables with missing values on the variables with available values are computed from the estimates of the mean and of the covariance matrix. Second, the missing values in a record are filled in with their conditional expectation values given the available values and the estimates of the mean and of the covariance matrix, the conditional expectation values being the product of the available values and the estimated regression coefficients. Third, the mean and the covariance matrix are re-estimated, the mean as the sample mean of the completed dataset and the covariance matrix as the sum of the sample covariance matrix of the completed dataset and the contributions of the conditional covariance matrices of the imputation errors in the records with imputed values (see, e.g., Little and Rubin 1987, chapter 8). The EM algorithm starts with initial estimates of the mean and of the covariance matrix and cycles through these steps until the imputed values and the estimates of the mean and of the covariance matrix stop changing appreciably from one iteration to the next.

For the following formal description of the EM algorithm, let **X** ∈ ℜ^{n×p} be a data matrix with *n* records consisting of *p* variables, with the values of some of the variables missing in some records. The *p* variables might represent a geophysical field at *p* different locations, and the *n* records might represent incomplete measurements of the *p* variables at *n* different times. In the conventional EM algorithm, the number *n* of records is assumed to be much greater than the number *p* of variables, so that the sample covariance matrix of the dataset completed with imputed values is positive definite.

From the incomplete dataset, the mean *μ* ∈ ℜ^{1×p} of the records and the covariance matrix **Σ** ∈ ℜ^{p×p} of the variables are to be estimated. For a given record **x** = **X**_{i:} with missing values,^{1} let the vector **x**_{a} ∈ ℜ^{1×pa} consist of the *p*_{a} variables for which, in the given record, the values are available, and let the vector **x**_{m} ∈ ℜ^{1×pm} consist of the remaining *p*_{m} variables for which, in the given record, the values are missing. Let the mean *μ* be partitioned correspondingly into a part *μ*_{a} ∈ ℜ^{1×pa} with the mean values of the variables for which, in the given record, the values are available, and a part *μ*_{m} ∈ ℜ^{1×pm} with the mean values of the variables for which, in the given record, the values are missing. For each record **x** = **X**_{i:} (*i* = 1, . . . , *n*) with missing values, the relationship between the variables with missing values and the variables with available values is modeled by a linear regression model

The matrix **B** ∈ ℜ^{pa×pm} is a matrix of regression coefficients, and the residual **e** ∈ ℜ^{1×pm} is assumed to be a random vector with mean zero and unknown covariance matrix **C** ∈ ℜ^{pm×pm}. In each iteration of the EM algorithm, estimates of the mean *μ* and of the covariance matrix **Σ** are taken as given, and from these estimates, the conditional maximum likelihood estimates of the matrix of regression coefficients **B** and of the covariance matrix **C** of the residual are computed for each record with missing values. With the estimated regression model for each record, the missing values are then filled in with imputed values, and new estimates of the mean *μ* and of the covariance matrix **Σ** are computed from the completed dataset and from the estimates of the residual covariance matrices **C**.^{2}

Let *μ̂*^{(t)} and **Σ̂**^{(t)} denote the estimates of the mean and of the covariance matrix in the *t*th iteration of the EM algorithm. (The hat accent **Â** designates an estimate of a quantity **A**.) The estimates of the mean and of the covariance matrix are either the result of the preceding EM iteration or, in the first EM iteration, they may be the sample mean and the sample covariance matrix of the dataset with initial guesses filled in for the missing values. For a given record **x** = **X**_{i:} with missing values, let the covariance matrix estimate **Σ̂**^{(t)} be partitioned corresponding to the partitioning of the given record into variables with available values and variables with missing values: let the submatrix **Σ̂**_{aa} of the estimated covariance matrix **Σ̂**^{(t)} consist of the estimated variances and covariances of the variables for which, in the given record, the values are available; let the submatrix **Σ̂**_{mm} consist of the estimated variances and covariances of the variables for which, in the given record, the values are missing; and let the two submatrices **Σ̂**_{am} and **Σ̂**_{ma} with **Σ̂**_{am} = **Σ̂**^{T}_{ma} consist of the estimated cross-covariances of the variables for which, in the given record, the values are available with the variables for which, in the given record, the values are missing.^{3} Given the partitioned estimate of the covariance matrix **Σ̂**^{(t)}, the conditional maximum likelihood estimate of the regression coefficients can be written as

(cf. Mardia et al. 1979, chapter 6.2). From the structure of the regression model (1) follows that, given an estimate **B̂** of the regression coefficients and the partitioned estimate of the covariance matrix **Σ̂**^{(t)}, an estimate of the residual covariance matrix takes the generic form

Upon substitution of the conditional maximum likelihood estimate (2) of the regression coefficients, the conditional maximum likelihood estimate of the residual covariance matrix turns out to be the Schur complement

of the submatrix **Σ̂**_{aa} in the covariance matrix estimate **Σ̂**^{(t)} (cf. Mardia et al. 1979, chapter 6.2). As a Schur complement of a positive definite matrix **Σ̂**^{(t)}, the residual covariance matrix **Ĉ** is assured to be positive definite (Horn and John 1985, p. 472).

The conditional expectation value **x̂**_{m} ≡ *E*(**x**_{m}|**x**_{a}; *μ̂*^{(t)}, **Σ̂**^{(t)}) of the missing values in a given record follows from the estimated regression coefficients **B̂** and the available values **x**_{a} as

where the vector *μ̂*_{a} is that part of the mean estimate *μ̂*^{(t)} that belongs to the variables for which, in the given record, the values are available, and the vector *μ̂*_{m} is that part of the mean estimate *μ̂*^{(t)} that belongs to the variables for which, in the given record, the values are missing.

After the missing values in all records **x** = **X**_{i:} (*i* = 1, . . . , *n*) have thus been filled in with imputed values **x̂**_{m}, the sample mean

of the completed dataset is a new estimate of the mean of the records. A new estimate of the covariance matrix follows from the conditional expectation of the cross-products **Ŝ**^{(t)}_{i} ≡ *E*[**X**^{T}_{i:}**X**_{i:}|**x**_{a}; *μ̂*^{(t)}, **Σ̂**^{(t)}] as

where, for each record **x** = **X**_{i:}, the conditional expectation **Ŝ**^{(t)}_{i} of the cross-products is composed of three parts. The two parts that involve the available values in the record,

are sample cross-products of values in the completed record. The part that involves exclusively the imputed values in the record,

is the sum of the cross-product of the imputed values and the residual covariance matrix **Ĉ** = Cov(**x**_{m}, **x**_{m}|**x**_{a};*μ̂*^{(t)}, **Σ̂**^{(t)}), the conditional covariance matrix of the imputation error. The normalization constant *ñ* of the covariance matrix estimate (6) is the number of degrees of freedom of the sample covariance matrix of the completed dataset. If, as above, one mean vector *μ* is estimated, the number of degrees of freedom is *ñ* = *n* − 1. If, as will be described in section 4c, *S* mean vectors of *S* groups of records (for example, *S* seasonal mean vectors) are estimated, the number of degrees of freedom is *ñ* = *n* − *S.* The covariance matrix (6) is computed with the factor 1/*ñ* in place of the factor 1/*n* with which a maximum likelihood estimate would be computed, in order to correct the bias of the maximum likelihood estimate in a manner that parallels the bias-correction in the case of a complete dataset (cf. Beale and Little 1975). Thus, the new estimate (6) of the covariance matrix is computed in the same way as the sample covariance matrix of the completed dataset, except that, for each record with missing values, the estimated residual covariance matrix **Ĉ** is added to the cross-products **x̂**^{T}_{m}**x̂**_{m} of the imputed values (cf. Little and Rubin 1987, chapter 8).

The next iteration of the EM algorithm is carried out with the updated estimates *μ̂*^{(t+1)} and **Σ̂**^{(t+1)} of the mean and of the covariance matrix. The iterations are stopped when the algorithm has converged, that is, when the estimates *μ̂*^{(t)} and **Σ̂**^{(t)} and the imputed values **x̂**_{m} stop changing appreciably. The EM algorithm converges monotonically in that the likelihood of the available data increases monotonically from iteration to iteration. However, the EM algorithm converges only linearly, with a rate of convergence that depends on the fraction of values that are missing in the dataset, and so it may need many iterations to converge [see Little and Rubin (1987, chapters 7 and 8) for a more rigorous derivation and properties of the EM algorithm].

If, for any record, the number *p*_{a} of variables with available values is greater than the number *ñ* of degrees of freedom available for the estimation of the covariance matrix, the submatrix **Σ̂**_{aa} of the covariance matrix estimate **Σ̂**^{(t)} is singular and the conditional maximum likelihood estimate (2) of the matrix of regression coefficients **B** is not defined. The submatrix **Σ̂**_{aa} of the covariance matrix estimate may already be poorly conditioned if the number *ñ* of degrees of freedom only marginally exceeds the number *p*_{a} of available values in a record. In such ill-posed or ill-conditioned cases, it is necessary to regularize the estimate (2) of the regression coefficients.

## 3. The regularized EM algorithm

The regularized EM algorithm consists of the same steps as the EM algorithm, with the exception that, in each iteration and for each record with missing values, the inverse matrix **Σ̂**^{−1}_{aa} in the estimate (2) of the regression coefficients is replaced with a regularized inverse

where **D̂** = Diag(**Σ̂**_{aa}) is the diagonal matrix consisting of the diagonal elements of the covariance matrix **Σ̂**_{aa} and the scalar *h* is a regularization parameter. That is, the ill-defined or ill-conditioned inverse **Σ̂**^{−1}_{aa} is replaced with the inverse of the matrix that results from the covariance matrix **Σ̂**_{aa} when the diagonal elements are inflated by the factor 1 + *h*^{2}. This method of regularizing the inverse of a matrix, in which a regularized inverse is formed as the inverse of the sum of the matrix and a multiple of a positive definite matrix, is called ridge regression in the statistics literature and Tikhonov regularization in the literature on numerical linear algebra [Hoerl and Kennard (1970a,b); Tikhonov and Arsenin (1977); see Hansen (1997, chapter 5) for a review and Tarantola (1987, chapter 1) for a Bayesian justification]. In statistics, the regularization parameter *h* is known as the ridge parameter.

First, we will develop a representation of the regularized estimates of the regression parameters that makes some properties of ridge regression manifest and leads to a procedure for computing the regression parameters in the regularized EM algorithm. Second, we will describe a criterion for the choice of the regularization parameter *h.* Third, we will juxtapose two variants of ridge regression, both of which can be used in the regularized EM algorithm. A more detailed discussion of the methods presented below can be found in the referenced literature.

### a. Ridge regression

In terms of the correlation matrix

and the scaled cross-covariance matrix

the regularized estimate of the regression coefficients can be written as

where

is termed the standard form of the estimate (cf. Hansen 1997, chapter 2.3). The fact that the correlation matrix **Σ̂**^{′}_{aa} and the scaled cross-covariance matrix **Σ̂**^{′}_{am} can be factored in similar ways can be exploited to cast the problem of estimating the regression coefficients from[bu793]scaled submatrices **Σ̂**^{′}_{aa} and **Σ̂**^{′}_{am} of a given covariance matrix estimate **Σ̂**^{(t)} into the more conventional form of estimating regression coefficients from given data matrices. This recasting of the estimation problem will lead to a representation of the regularized regression coefficients that makes some properties of ridge regression manifest and translates into a procedure for computing the regression coefficients in the regularized EM algorithm.

The correlation matrix **Σ̂**^{′}_{aa}, the scaled cross-covariance matrix **Σ̂**^{′}_{am}, and the submatrix **Σ̂**_{mm} of the covariance matrix estimate **Σ̂**^{(t)} can be decomposed into factors **X**_{a} ∈ ℜ^{ñ×pa} and **X**_{m} ∈ ℜ^{ñ×pm}, such that

The factors **X**_{a} and **X**_{m} can be viewed as analogues of data matrices whose second-moment matrices **X**^{T}_{a}**X**_{a}/*ñ, ***X**^{T}_{a}**X**_{m}/*ñ,* and **X**^{T}_{m}**X**_{m}/*ñ* are the scaled submatrices (13) of the covariance matrix estimate **Σ̂**^{(t)}.

The sampling error of the covariance matrix estimate **Σ̂**^{(t)} contributes to the error of the imputed values and hence will play a role in determining the regularization parameter *h* (see section 3b and the appendix). Let us assume that the sampling error of the covariance matrix estimate is equal to the sampling error that would be expected if the dataset were complete and if the covariance matrix estimate **Σ̂**^{(t)} were the sample covariance matrix. The distribution of the sampling error of a sample covariance matrix is a function of the number *ñ* of degrees of freedom available for the estimation of the covariance matrix (see, e.g., Mardia et al. 1979, chapter 3.4), and so, in order for the assumed sampling error of the scaled submatrices (13) to be equal to the sampling error that would be expected for second moment matrices of actual data matrices **X**_{a} and **X**_{m}, it is necessary that the number of rows of the factors **X**_{a} and **X**_{m} be equal to the number *ñ* of degrees of freedom. That is, the factors **X**_{a} and **X**_{m} must have *ñ* = *n* − 1 rows if one mean vector *μ* is estimated from the dataset **X** and *ñ* = *n* − *S* rows if *S* mean vectors of *S* groups of records are estimated (see section 4c).

The factorization (13) of the scaled submatrices can, for instance, be obtained from an eigendecomposition **Σ̂**^{(t)} = **T****Φ**^{2}**T**^{T} of the covariance matrix estimate, with a matrix **T** ∈ ℜ^{p×ñ} containing as its columns the mutually orthogonal eigenvectors of the covariance matrix estimate **Σ̂**^{(t)} and with a diagonal matrix **Φ**^{2} = Diag(*ϕ*^{2}_{j}) of eigenvalues *ϕ*^{2}_{j} (*j* = 1, . . . , *ñ*). Let the submatrix **T**_{a} ∈ ℜ^{pa×ñ} consist of those rows of the eigenvector matrix **T** that belong to the variables for which, in the record under consideration, the values are available, and let the submatrix **T**_{m} ∈ ℜ^{pm×ñ} consist of the remaining rows of the eigenvector matrix **T** that belong to the variables for which, in the record under consideration, the values are missing. In terms of the partitioned eigendecomposition of the covariance matrix estimate **Σ̂**^{(t)}, the factors **X**_{a} and **X**_{m} can be written as

which shows that a factorization of the form (13) exists. If the number *p* of variables is greater than or equal to the number *ñ* of degrees of freedom available for the estimation of the covariance matrix, the number *ñ* of degrees of freedom is just the number of nonzero eigenvalues *ϕ*^{2}_{j} of the covariance matrix estimate **Σ̂**^{(t)}. If the number *p* of variables is less than the number *ñ* of degrees of freedom, the number of nonzero eigenvalues *ϕ*^{2}_{j} is less than the number *ñ* of degrees of freedom. In this latter case, the factors **X**_{a} and **X**_{m} could be a product of the above form (14), provided that the matrix **Φ** with the square roots of the eigenvalues *ϕ*^{2}_{j} is completed with zeros to have *ñ* rows. However, the form of the factors is irrelevant for the present argument. What is relevant is that a factorization (13) of the scaled submatrices of the covariance matrix estimate **Σ̂**^{(t)} exists.

The factors **X**_{a} and **X**_{m} can be interpreted as the data matrices in the linear regression model

where **E** ∈ ℜ^{ñ×pm} is a matrix of residuals. From the factorization (13) of the scaled submatrices **Σ̂**^{′}_{aa} and **Σ̂**^{′}_{am} of the covariance matrix estimate **Σ̂**^{(t)} follows that estimating the regression coefficients **B**′ of the regression model (15) from given data matrices **X**_{a} and **X**_{m} is equivalent to estimating the standard form **B**′ = **D̂**^{1/2}**B** of the regression coefficients of the model (1) froma given covariance matrix estimate **Σ̂**^{(t)}. The standard form **B̂**^{′}_{h} = (**Σ̂**^{′}_{aa} + *h*^{2}**I**)^{−1}**Σ̂**^{′}_{am} of the regularized regression coefficients expressed in terms of the submatrices of the covariance matrix estimate **Σ̂**^{(t)} is identical to the standard form **B̂**^{′}_{h} = (**X**^{T}_{a}**X**_{a} + *ñh*^{2}**I**)^{−1}**X**^{T}_{a}**X**_{m} of the regularized regression coefficients expressed in terms of the data matrices **X**_{a} and **X**_{m}. Moreover, for any estimate **B̂**′[bu793]of the standard form regression coefficients **B**′, the second-moment matrix **Ê**^{T}**Ê**/*ñ* of the estimated residuals

is identical to the generic estimate (3) of the residual covariance matrix **C** of the regression model (1). Hence, estimating the regression coefficients and the residual second-moment matrix of the regression model (15) from given data matrices **X**_{a} and **X**_{m} is equivalent to estimating the regression coefficients and the residual covariance matrix of the regression model (1) from a given covariance matrix estimate **Σ̂**^{(t)}. Since, under the above assumptions on the sampling error of the covariance matrix estimate **Σ̂**^{(t)}, the expected sampling errors of the estimated parameters also coincide, estimating the parameters of the regression model (1) from a given estimate **Σ̂**^{(t)} of the covariance matrix is equivalent to estimating the parameters of the regression model (15) from given data matrices **X**_{a} and **X**_{m}. This equivalence makes it possible to apply standard methods for the regularization of conventional regression models (15) to the regression model (1) figuring in the EM algorithm.

A revealing representation of the ridge regression coefficients results from a singular value decomposition of the matrix **X**_{a} (cf. Hansen 1997, chapter 5). Let usrescale the factors **X̃**_{a} = **X**_{a}/(*ñ*)^{1/2} and **X̃**_{m} = **X**_{m}/*ñ*^{1/2} such that in the factorization of the correlation matrix **Σ̂**^{′}_{aa} = **X̃**^{T}_{a}**X̃**_{a} and of the scaled cross-covariance matrix **Σ̂**^{′}_{am} = **X̃**^{T}_{a}**X̃**_{m} the number *ñ* of degrees of freedom no longer appears explicitly. Whatever form is ascribed to the rescaled factor **X̃**_{a}, it has a singular value decomposition **X̃**_{a} = **U****Λ****V**^{T}, where **U** and **V** are orthogonal matrices and **Λ** = Diag(*λ*_{j}) is the diagonal matrix of singular values *λ*_{j}. In the basis of the singular value decomposition, the correlation matrix becomes **Σ̂**^{′}_{aa} = **V****Λ**^{2}**V**^{T}, which implies that the squared singular values *λ*^{2}_{j} are the eigenvalues of the correlation matrix **Σ̂**^{′}_{aa} and that the right singular vectors **V**_{:j}, the columns of the matrix **V**, are the corresponding eigenvectors (see, e.g., Golub and van Loan 1993, chapter 2.5). Substituting the factorization (13) and the singular value decomposition of the rescaled factor **X̃**_{a} into the standard form estimate (12) yields the representation

of the regression coefficients. The elements of the matrix **F** ≡ **U**^{T}**X̃**_{m} are called Fourier coefficients, in analogy to inverse problems in which the counterpart of the matrix **X̃**_{a} is a convolution operator whose singular value decomposition is equivalent to a Fourier expansion (cf. Wahba 1977; Anderssen and Prenter 1981).

The representation (16) of the regression coefficients shows that, in the standard form, the columns of the regression coefficient matrix **B̂**^{′}_{h} are linear combinations of the eigenvectors **V**_{:j} of the correlation matrix **Σ̂**^{′}_{aa}. Only the eigenvectors **V**_{:j} belonging to nonzero eigenvalues *λ*^{2}_{j} contribute to the regression coefficients. The weights of the eigenvectors **V**_{:j} are given by the products of the scalars *λ*_{j}/(*λ*^{2}_{j} + *h*^{2}) and the Fourier coefficients **F**_{j:}, which implies that only those rows **F**_{j:} of the Fourier coefficient matrix that belong to nonzero eigenvalues *λ*^{2}_{j} contribute to the regression coefficients.

The Fourier coefficients can be expressed in terms of the scaled cross-covariance matrix **Σ̂**^{′}_{am} and of the nonzero eigenvalues and corresponding eigenvectors of the correlation matrix **Σ̂**^{′}_{aa}. Since, in terms of the singular value decomposition of the rescaled factor **X̃**_{a}, the scaled cross-covariance matrix **Σ̂**^{′}_{am} = **X̃**^{T}_{a}**X̃**_{m} can be written as **Σ̂**^{′}_{am} = (**V****Λ****U**^{T})**X̃**_{m} = **V****Λ****F**, we can take

as the matrix of Fourier coefficients, the diagonal matrix **Λ**^{+} = Diag(*λ*^{+}_{j}) being the pseudoinverse of the singular value matrix **Λ**; that is, the diagonal elements of the pseudoinverse **Λ**^{+} are *λ*^{+}_{j} = 1/*λ*_{j} if *λ*_{j} > 0 and *λ*^{+}_{j} = 0 if *λ*_{j} = 0. [In actual computations, an element *λ*^{+}_{j} of the pseudoinverse should be set to zero if the singular value *λ*_{j} is smaller than a threshold value ɛ that depends on the machine precision; see, e.g., Golub and van Loan (1993, chapter 5).] If the *j*th eigenvalue *λ*^{2}_{j} of the correlation matrix **Σ̂**^{′}_{aa} is zero, the *j*th row **F**_{j:} of the Fourier coefficient matrix (17) consists of zeros and might thus differ from the *j*th row of the matrix **U**^{T}**X̃**_{m} that was originally defined to be the matrix of Fourier coefficients. But since all other rows of these matrices—the rows belonging to nonzero eigenvalues *λ*^{2}_{j}—agree, the differences in the rows belonging to zero eigenvalues do not affect the estimate (16) of the regression coefficients.

Thus, we can compute the regression coefficients **B̂**^{′}_{h} from the partitioned covariance matrix estimate **Σ̂**^{(t)} as a product (16) that involves the nonzero eigenvalues and corresponding eigenvectors of the correlation matrix **Σ̂**^{′}_{aa} and the Fourier coefficients (17). If there are *ñ* degrees of freedom for the estimation of the covariance matrix **Σ** and *p*_{a} available values in the record for which the regression parameters are estimated, the number *r* of nonzero eigenvalues of the correlation matrix is at most *ñ* or *p*_{a}, whichever is smaller. Henceforth, we let the eigenvalue matrix **Λ**^{2} ∈ ℜ^{r×r} and the eigenvector matrix **V** ∈ ℜ^{pa×r} contain only the *r* nonzero eigenvalues and corresponding eigenvectors, and we similarly restrict the Fourier coefficient matrix **F** ∈ ℜ^{r×pm} to the *r* relevant rows. The expression (16) for the standard form estimate of the regression coefficients remains valid with these restricted matrices.

The covariance matrix of the residuals, which, in updating the covariance matrix estimate at the end of each EM iteration, is added to the cross-products (9) of the completed data matrix, can also be represented in a factored form. Substituting the estimate **B̂**_{h} = **D̂**^{−1/2}**B̂**^{′}_{h} of the regression coefficients into the generic expression (3) for the residual covariance matrix yields the estimate

The term

which is independent of the regularization parameter *h,* vanishes if the regression coefficients are not overdetermined, which is the case if the number *ñ* of degrees of freedom for the estimation of the covariance matrix **Σ** is less than or equal to the number *p*_{a} of variables with available values. Since the residual covariance matrix depends on the regularization method and on the regularization parameter, both of which cannot usually be chosen a priori, without reference to the dataset under consideration, the residual covariance matrix is not, as in the well-posed case, the conditional covariance matrix of the imputation error. The uncertainties about the adequacy of the regularization method and the regularization parameter contribute to the conditional imputation error given the estimates of the mean and of the covariance matrix, but the residual covariance matrix does not account for these uncertainties. Nevertheless, substituting the residual covariance matrix (18) for the conditional covariance matrix of the imputation error in updating the covariance matrix estimate at the end of each EM iteration seems a plausible heuristic.

The representation (16) makes manifest the way in which ridge regression regularizes the regression coefficients, that is, the way in which the noise, the high-frequency or small-scale components of the data, is filtered out. Both the regression coefficients regularized by a truncated principal component analysis of the correlation matrix Σ̂^{′}_{aa} and the regression coefficients regularized by ridge regression can be written as

where what are called the filter factors *f*_{j} depend on the regularization method (Hansen 1997, chapter 4.2). For principal component regression, the filter factors *f*_{j} of the retained principal component vectors (EOFs) **V**_{:j} are unity, and the filter factors of the discarded principal component vectors are zero. Regularization by a truncated principal component analysis of the correlation matrix **Σ̂**^{′}_{aa}, which is what applied mathematicians call regularization by truncated singular value decomposition (cf. Schneider and Griffies 1999; Hansen 1997, chapter 3), corresponds to filtering with a stepfunction filter. For ridge regression, the filter factors are

This filter function is structurally identical to the Wiener filter. The eigenvalues *λ*^{2}_{j} are the correlate of the spectral density of what is called the signal in Wiener filtering, and the squared regularization parameter *h*^{2} is the correlate of the spectral density of what is called the noise in Wiener filtering (Papoulis 1991, chapter 14.1; Anderssen and Prenter 1981). The filter function of ridge regression decays more slowly with decreasing eigenvalues *λ*^{2}_{j} than the step function filter of principal component regression. Principal component vectors with eigenvalues *λ*^{2}_{j} much greater than the squared regularization parameter *h*^{2} are unaffected by the filtering. Principal component vectors with eigenvalues *λ*^{2}_{j} much smaller than the squared regularization parameter *h*^{2} are effectively filtered out (Hansen 1997, chapter 4.2).

For typical climate data, which do not have an evident gap in the eigenvalue spectrum and whose samples are so small that only a few principal components can be retained in a truncated principal component analysis, leaving only a small choice of possible truncation parameters, the smoother filtering afforded by ridge regression and the greater flexibility of a continuous regularization parameter could offer advantages over principal component regression. The structural parallels between the ridge regression filter and the optimal Wiener filter moreover suggest that ridge regression might suppress noise in the data in a more robust way and with less loss of relevant information than principal component regression (cf. Anderssen and Prenter 1981). Indeed, ridge regression also arises as a regularization method when the observational error in the available data, which is ignored in the regression model (1), is explicitly taken into account (Golub et al. 2000). In the regression model (1), the available values **x**_{a} in a record are taken as known and observational errors are neglected, but ridge regression in the form presented here is still an adequate regularization method if the available values are affected by a non-negligible observational error whose relative variance—the variance of the observational error relative to the variance of the observed variable—is homogeneous throughout the dataset. By choosing a regularized inverse (10) with a different matrix **D̂**, one that, in contrast to the matrix **D̂** above, does not consist of the diagonal elements of the covariance matrix **Σ̂**_{aa}, other variance structures of the observational error can be accommodated in ridge regression (cf. Golub et al. 2000). To be sure, observational errors are also taken into account in a regularization method known as truncated total least squares, in which regression coefficients are computed in a truncated basis of principal components of the overall covariance matrix **Σ̂**^{(t)} instead of the scaled submatrix **Σ̂**^{′}_{aa} (Fierro et al. 1997). But the continuous regularization parameter of ridge regression might still offer advantages over a truncated principal component analysis when there is only a small choice of possible truncation parameters.

### b. Generalized cross-validation

In the regularized EM algorithm, the estimated regression coefficients are not of interest in their own right but only as intermediaries in the imputation of missing values. As a criterion for the choice of the regularization parameter *h,* it is therefore suitable to require that the error of the imputed values be as small as possible. As the regularization parameter tends to zero, the imputed values are increasingly affected by noise, implying an increasing sampling error. Conversely, as the regularization parameter tends to infinity, the ridge regression coefficients tend to zero and the imputed values (4) tend to the estimated mean values, implying an increasing regularization error (cf. Hansen 1997, chapter 7). A good choice of regularization parameter, in between the limiting cases of zero and infinity, should minimize the total imputation error, the sum of the sampling error and the regularization error.

Golub et al. (1979) argued that the regularization parameter *h* that minimizes the expected mean-squared error of predictions with an estimated linear regression model (15) is approximately equal to the minimizer of the generalized cross-validation (GCV) function

an object function that resembles the object function of ordinary cross-validation but is, in contrast to the latter, invariant under orthogonal transformations of the data [see Wahba (1990, chapter 4) and Hansen (1997, chapter 7.4) for reviews]. The notations ‖**A**‖_{F} and tr **A** indicate the Frobenius norm and the trace of a matrix **A**,^{4} and the matrix

in the denominator of the GCV function is the regularized pseudoinverse of the data matrix **X**_{a}. With the regularized pseudoinverse **X**^{†}_{a} of the data matrix **X**_{a}, the regularized regression coefficients of the model (15) can be written as **B̂**^{′}_{h} = **X**^{†}_{a}**X**_{m}, which, if the data matrices **X**_{a} and **X**_{m} are again regarded as the factors in the decomposition (13) of the correlation matrix **Σ̂**^{′}_{aa} and of the scaled cross-covariance matrix **Σ̂**^{′}_{am}, is identical to the standard form (12) of the regularized regression coefficients. Under the assumptions of section 3a on the sampling error of the correlation matrix **Σ̂**^{′}_{aa} and of the scaled cross-covariance matrix **Σ̂**^{′}_{am}, the sampling error of the regularized regression coefficients is equal to the sampling error that would be expected if the regularized regression coefficients were estimated from actual data matrices **X**_{a} and **X**_{m} with *ñ* records, so that the regularization parameter *h* that minimizes the expected mean-squared error of the imputed values is likewise approximately equal to the minimizer of the GCV function (20). Therefore, in each iteration of the regularized EM algorithm, the regularization parameter *h* for each record with missing values is chosen as the minimizer of the GCV function (20).

An alternative form of the GCV function follows from the eigendecomposition of the correlation matrix **Σ̂**^{′}_{aa} and the derived representations of the regression coefficients and of the residual covariance matrix. Since the squared Frobenius norm of a matrix is equal to the trace of the product of the matrix and its transpose, ‖**A**‖^{2}_{F} = tr(**A**^{T}**A**), the squared Frobenius norm ‖**X**_{a}**B̂**^{′}_{h} − **X**_{m}‖^{2}_{F} = ‖**Ê**‖^{2}_{F} in the numerator of the GCV function is proportional to the trace of the residual covariance matrix **Ĉ**_{h} = **Ê**^{T}**Ê**/*ñ.* Hence, the GCV function can be written as

where

an effective number of degrees of freedom for the estimation of the residual covariance matrix **Ĉ**_{h} (Wahba 1990, chapter 4), can be expressed in terms of the filter factors (19) as

(Hansen 1997, chapter 7.2). For a given regularization parameter *h,* evaluating both the trace tr **Ĉ**_{h} of the residual covariance matrix from the factored representation (18) and the effective number of degrees of freedom 𝒯(*h*) from the filter factors (19) requires *O*(*r*) operations, where *r* is the number of nonzero eigenvalues of the correlation matrix **Σ̂**^{′}_{aa} (cf. Hansen 1994, 1997, chapter 4.6). That is, if the ridge regression is computed via an eigendecomposition of the correlation matrix **Σ̂**^{′}_{aa}, only a small additional effort is required to find, with one of the common scalar optimization methods, the regularization parameter *h* that minimizes the GCV function.

With the regularization parameter determined by generalized cross-validation, the regularized estimates of the imputed values are usually reliable, even when the noise in the data, which might be a result of observational errors, is not Gaussian and has an inhomogeneous variance structure (see, e.g., Wahba 1990, chapter 4.9). Since with small but nonzero probability the GCV function has a minimum near zero, generalized cross-validation occasionally leads to a regularization parameter near zero when, in fact, a greater regularization parameter would be more appropriate (Wahba and Wang 1995). Choosing too small a regularization parameter in such cases can be avoided by constructing a lower bound for the regularization parameter from a priori guesses of the magnitude of the imputation error (see, e.g., Hansen 1997, chapters 7.7 and 7.2).

### c. Multiple and individual ridge regressions

If ridge regression with generalized cross-validation is used in the regularized EM algorithm as described above, the regularization of the regression coefficients is controlled by one regularization parameter per record with missing values. For each record, the regression coefficients of all variables with missing values are estimated jointly by multiple ridge regression. With generalized cross-validation, the regularization parameter is chosen such as to minimize, approximately, the expected mean-squared error of the imputed values.

However, with the above methods, it is also possible to estimate individually regularized regression coefficients for each missing value. The matrix of regression coefficients (11) can be computed columnwise with an individual regularization parameter for each column. Instead of only one regularization parameter per record in multiple ridge regressions, in individual ridge regressions we can, for a record with *p*_{m} missing values, adjust *p*_{m} regularization parameters. Choosing the regularization parameter for each column of the regression coefficient matrix by generalized cross-validation approximately minimizes not only the expected average error of the imputed values in the record, but also the expected error of each individual imputed value.

The computation of individual ridge regressions is similar to the computation of a multiple ridge regression. If the ridge regression is computed via an eigendecomposition of the correlation matrix **Σ̂**^{′}_{aa}, one obtains the standard form estimate (16) of the regression coefficients columnwise from the columns of the Fourier coefficient matrix (17), with an individual regularization parameter *h*_{j} (*j* = 1, . . . , *p*_{m}) for each column. The regularization parameters *h*_{j} are determined as the minimizers of the GCV function (20), where the numerator of the GCV function reduces from the squared Frobenius norm of a residual matrix to the squared Euclidean norm of a residual vector. Generalizing the factored representation (18) of the residual covariance matrix from the case of a multiple ridge regression to that of individual ridge regressions, one finds that the residual covariance matrix of individual ridge regressions consists of the elements

where **Γ**^{(j)} ≡ *h*^{2}_{j}(**Λ**^{2} + *h*^{2}_{j}**I**)^{−1} is a diagonal matrix and *h*_{j} is the regularization parameter for the *j*th column of the matrix of regression coefficients. In a regularized EM algorithm with individual ridge regressions, this residual covariance matrix is added to the cross-products **x̂**^{T}_{m}**x̂**_{m} of the imputed values when a new estimate of the covariance matrix (6) is assembled.

Thus, the additional computational expense of individual ridge regressions in place of a multiple ridge regression is merely that which is required to minimize the GCV function *p*_{m} times for *p*_{m} residual vectors, compared with minimizing it once for one residual matrix with *p*_{m} column vectors. As long as the greater number of regularization parameters to be estimated does not lead to the estimated regularization parameters becoming unreliable, the greater accuracy of the imputed values that can be expected with individual ridge regressions suggests the use of individual ridge regressions in the regularized EM algorithm whenever computationally feasible.

## 4. Exploiting spatial and temporal covariability

By arranging a dataset into data matrices **X** in different ways, one can, with the regularized EM algorithm, exploit spatial and temporal covariability for the imputation of missing values. Suppose the dataset under consideration consists of time series of *m* geophysical variables. The time series of each of the *m* variables would, if the dataset were complete, span *N* instants (*N* years, for example), but the values of some of the variables are missing for some instants. The *m* variables could, for example, represent yearly mean surface temperatures at *m* stations or at *m* grid points. Or some of the *m* variables could represent local surface temperature measurements and others proxies of surface temperatures, such as dendroclimatic data or isotope fractions in ice cores, so that the regularized EM algorithm exploits the covariability of the measurements and the proxy data to impute missing measurement values given available proxy data. Let the *m*-dimensional row vector **y**_{ν} contain the values of the *m* variables at instant *ν* (*ν* = 1, . . . , *N*). Whether the regularized EM algorithm estimates, and exploits for the imputation of missing values, the spatial covariance matrix or a mixed spatiotemporal covariance matrix depends on how the *N* data vectors **y**_{ν} ∈ ℜ^{1×m} are arranged into the data matrix **X** ∈ ℜ^{n×p}.

### a. Spatial covariability

The fundamental kind of covariability that the regularized EM algorithm exploits for the imputation of missing values is spatial covariability, a synchronic covariability of the variables at a single instant *ν.* If the data vectors **y**_{ν} form the rows of the data matrix

with *n* = *N* and *p* = *m,* the regularized EM algorithm calculates an estimate of the temporal mean *μ* of the data vectors **y**_{ν} and an estimate of the spatial covariance matrix **Σ** = Cov(**y**_{ν}, **y**_{ν}). The missing values for an instant *ν* are filled in with imputed values that are conditioned on the available values for the same instant *ν* and on the estimates of the mean and of the spatial covariance matrix.

### b. Stationary temporal covariability

In addition to the synchronic spatial covariability at a single instant *ν,* the variables of the dataset may also possess diachronic covariability across different instants *ν.* This diachronic covariability can be exploited to improve the accuracy of the imputed values and of the estimates of the temporal mean and of the spatial covariance matrix of the data. If the instants for which the data are available are equally spaced in time and if the temporal variability of the data is stationary, the diachronic covariability across three neighboring instants can be taken into account in the regularized EM algorithm by arranging the data vectors **y**_{ν} for three successive instants *ν* into the rows of the data matrix

with *n* = *N* − 2 and *p* = 3*m.* With such a data matrix, the regularized EM algorithm calculates an estimate of the spatiotemporal covariance matrix

a block Toeplitz matrix^{5} composed of the spatial covariance matrix **Σ**_{0} = Cov(**y**_{ν}, **y**_{ν}), of the lag-1 covariance matrices **Σ**_{−1} = Cov(**y**_{ν}, **y**_{ν−1}) and **Σ**_{1} = **Σ**^{T}_{−1}, and of the lag-2 covariance matrices **Σ**_{−2} = Cov(**y**_{ν}, **y**_{ν−2}) and **Σ**_{2} = **Σ**^{T}_{−2}.

In each iteration of the regularized EM algorithm, the imputed values and the residual covariance matrix need to be computed only once for each data vector **y**_{ν} (*ν* = 1, . . . , *N*) and can then be used to update both the copis of the vector **y**_{ν} in the data matrix (24) and the corresponding elements of the covariance matrix estimate **Σ̂**. The imputed values in a data vector **y**_{ν} are conditioned on the available values in the vector **y**_{ν} itself and on the available values in two other data vectors for neighboring instants. Whether the imputed values for an instant *ν* are conditioned on the available values for the two preceding instants, the two succeeding instants, or one preceding and one succeeding instant depends on the position, within the data matrix, of the vector **y**_{ν} for which the imputed values are computed. For example, the imputed values of a vector **y**_{ν} in the first *m*-column block of the data matrix (24) are conditioned on the available values in the vector **y**_{ν} itself and on the available values in the two succeeding vectors **y**_{ν+1} and **y**_{ν+2}. Usually, one wants the imputed values for an instant to be conditioned on the available values for the instant itself and on the available values for the two instants nearest to it. Thus, in addition to computing the imputed values for the data vectors **y**_{1} and **y**_{N} at the ends of the dataset, which occur only once in the data matrix (24), one should compute the imputed values for the data vectors **y**_{2}, . . . , **y**_{N−1} in the central *m*-column block of the data matrix. The remaining elements of the data matrix and the estimate of the covariance matrix (25) should then be updated by copying the imputed values and the residual covariance matrices from the data vectors **y**_{2}, . . . , **y**_{N−1} in the central *m*-column block of the data matrix.

The resulting regularized EM algorithm calculates an estimate of the temporal mean of the data vectors **y**_{ν} and an estimate of the spatiotemporal covariance matrix (25). The ridge regression filter (19) acts not on the EOFs, the eigenvectors of the estimated spatial correlation matrix, but on the eigenvectors of the estimated spatiotemporal correlation matrix that belongs to the spatiotemporal covariance matrix (25). The missing values for an instant *ν* are filled in with imputed values that are conditioned on the available values for the same instant *ν,* on the available values for the two instants nearest to *ν,* and on the estimates of the mean and of the spatiotemporal covariance matrix.

It is possible to modify the above way of exploiting spatiotemporal covariability in the regularized EM algorithm. For example, higher-order temporal covariability can be taken into account by increasing the data matrix (24) further, that is, by arranging more data vectors **y**_{ν} into a row of the data matrix. The algorithm presented here is the simplest algorithm that exploits temporal covariability in a symmetric manner: in the imputation of missing values for a given instant in the interior of the dataset, the available data immediately preceding and immediately succeeding that instant are taken into account in a manner that is invariant under time reversal.

### c. Cyclostationary temporal covariability

If the dataset under consideration contains, for example, monthly mean temperatures for all months of several years, the temporal variability of the data is not stationary but cyclostationary. To take cyclostationary temporal covariability into account, it often suffices to allow for a periodically varying mean of the data. Suppose each data vector **y**_{ν} (*ν* = 1, . . . , *N*) belongs to one of *S* regimes *s* (*s* = 1, . . . , *S*). The regimes could be, for example, the months of the year or the seasons. The regularized EM algorithm allows for a regime-dependent mean if, in each iteration and for each record with missing values, the parts *μ̂*_{a} and *μ̂*_{m} of the mean vector in the estimated regression (4) are assembled not from an overall mean (5) of the records, but from regime-dependent mean values

where it is understood that, for the computation of the regime-dependent mean values for a given EM iteration, the missing values in the data vectors **y**_{ν} are filled in with the imputed values of the preceding EM iteration. The loss of degrees of freedom resulting from estimating *S* regime-dependent mean vectors, instead of only one mean vector, must be taken into account in the covariance matrix estimate (6) and in the GCV function (20) by taking *ñ* = *n* − *S,* in place of *ñ* = *n* − 1. Allowing for a regime-dependent mean, which is revised in each EM iteration, amounts to performing the regression (1) with the anomaly data of the deviations from the regime-dependent mean. If the spatial covariance matrix of the anomaly data is regime independent, the spatial covariability of the anomaly data can be exploited with an algorithm that is, up to the distinction of regime-dependent mean values, identical to that for spatial data (section 4a). If the spatiotemporal covariance matrix of the anomaly data is likewise regime independent, the temporal covariability can be exploited with an algorithm that is, up to the distinction of regime-dependent mean values, identical to that for stationary data (section 4b).

If a regime-dependent mean does not suffice to account for the cyclostationary variability of the data, one can also allow for a periodically varying spatiotemporal covariance matrix. It is then necessary to define a separate data matrix **X**_{s} for each regime *s,* such that the regularized EM algorithm calculates separate estimates of the mean and of the spatiotemporal covariance matrix for each regime. For each data vector **y**_{ν} (*ν* = 2, . . . , *N* − 1) in the interior of the dataset, the data matrix **X**_{s} of the regime *s* to which the vector **y**_{ν} belongs must contain, like the data matrix (24) for stationary data, a row vector

comprising the data vector itself and the data vectors for the preceding and the succeeding instant. The missing values of the data matrices **X**_{s} and the corresponding estimates of the spatiotemporal covariance matrices must be updated in each EM iteration. As in the regularized EM algorithm for stationary data, the imputed values and the residual covariance matrices should be computed for the data vectors **y**_{1} and **y**_{N} at the ends of the dataset and for the data vectors **y**_{2}, . . . , **y**_{N−1} in the central *m*-column blocks of the data matrices **X**_{s}. The remaining elements of the data matrices and the regime-dependent estimates of the spatiotemporal covariance matrix should then be updated by copying, possibly across different data matrices **X**_{s}, the imputed values and the residual covariance matrices from the data vectors **y**_{2}, . . . , **y**_{N−1} in the central *m*-column blocks of the data matrices. The resulting regularized EM algorithm calculates regime-dependent estimates of the mean and of the spatiotemporal covariance matrix and, using these estimates, fills in the missing values for an instant *ν* with imputed values that are conditioned on the available values for the same instant *ν* and on the available values for the two instants nearest to *ν.*

The values imputed with a regularized EM algorithm that exploits one of the above forms of temporal covariability are potentially more accurate than the values imputed with a regularized EM algorithm that exploits only spatial covariability. For a given set of *m*-dimensional data, the number *p* of variables in the data matrix and the number *S* of regimes distinguished can be viewed as discrete regularization parameters that control the level of detail with which temporal covariability is taken into account. As the size of the data matrix increases and more regimes are distinguished, the estimated temporal covariance information becomes more detailed, but, per covariance matrix element to be estimated, the number of degrees of freedom decreases. Moreover, the computational complexity of the regularized EM algorithm increases with the level of detail of the estimated covariance information. Whether employing a more complex rather than a simpler regularized EM algorithm is worth the added computational effort can be checked by comparing the values of the GCV function, and thus comparing rough estimates of the errors of the imputed values (see the appendix), for various compositions of the data matrix.

## 5. Other imputation techniques for climate data

Conventional techniques for the imputation of missing values in climate data, such as the techniques with which Smith et al. (1996), Kaplan et al. (1997, 1998), and Mann et al. (1998, 1999) impute missing values in surface temperature data, differ from the regularized EM algorithm in two respects.

First, the conventional techniques neglect the interdependence of the imputed values and the estimated statistics of the data. In all above-cited studies, a covariance matrix, be it a spatial or a spatiotemporal covariance matrix, is estimated from the complete or almost complete more recent records of the surface temperature datasets under consideration, whereupon missing values in the records for the years further past are filled in with imputed values that are conditioned on the estimated covariance matrix. If the statistics of the data are estimated from only a small portion of the records in the dataset, it is possible that in filling in missing values with imputed values, long-term climate variability is underestimated. The regularized EM algorithm reduces such an underestimation of long-term climate variability by allowing for the dependence of the estimated statistics of the data on all available values in the dataset, including the available values in incomplete records.

Second, the conventional techniques regularize the ill-posed or ill-conditioned problems that arise in the imputation of missing values by one or the other form of truncated principal component analysis, in which typically a single truncation parameter, or a single discrete regularization parameter, is chosen for an entire dataset. For example, Smith et al. (1996) estimate imputed temperature values for a given region by analyzing the regression of grid points with missing values on grid points with available values in a truncated basis of principal components of a covariance matrix estimate. Their regularized regression method, although it is not presented in terms of regression analysis, resembles what is known to statisticians as latent root regression and to applied mathematicians as truncated total least squares—a regularization method that arises, like ridge regression, when observational errors in the available data are explicitly taken into account (Webster and Gunst 1974; Fierro et al. 1997; van Huffel and Vandewalle 1991, chapter 3). Smith et al., as well as the other above-cited authors, choose a single regularization parameter per dataset, which means that the regularization does not adapt to the density and the pattern of the available values in the incomplete records. In the regularized EM algorithm, ridge regression regularizes the ill-posed or ill-conditioned estimation of regression coefficients, which, as argued in section 3a, might offer advantages over regularization methods that are based on truncated principal component analyses. Moreover, the continuous regularization parameter of ridge regression is chosen adaptively by generalized cross-validation, such that the regularization adapts to the density and the pattern of the available values in each incomplete record.

In that the above-cited conventional techniques estimate the statistics of the data under consideration only from a subset of the available data and regularize the ill-posed regression problems nonadaptively, the conventional techniques can be viewed as approximations to the regularized EM algorithm. Thus, on theoretical grounds one would expect that the regularized EM algorithm yields imputed values that are at least as accurate as the values imputed with one of the conventional techniques. In the limit of sufficiently large sample sizes in which regularization becomes unnecessary and the regularization parameter *h* can be set to zero, the regularized EM algorithm reduces to the EM algorithm, for which Dempster et al. (1977) proved some general optimality properties. However, that the EM algorithm for typical climate data involves the solution of ill-posed problems is more than a mere inconvenience that complicates the use of an otherwise optimal method: there are no general, problem-independent criteria according to which the optimality of a method for ill-posed problems can be established (Linz 1984). Hence, any claim that the regularized EM algorithm or any other technique for the imputation of missing values in climate data is “optimal” in some general sense would be unjustified. The performance of the regularized EM algorithm must be assessed in practice.

## 6. Test with simulated surface temperature data

The regularized EM algorithm was tested with sets of simulated monthly mean surface temperature data in which values were deleted in a manner characteristic for observational data. Because the missing values in the resulting sets of test data were known, the accuracy with which the regularized EM algorithm imputes missing values could be assessed. To be able to compare the performance of the regularized EM algorithm with that of a conventional noniterative imputation technique, the missing values in the test datasets were also imputed with the technique of Smith et al. (1996).

### a. The imputation technique of Smith et al. (1996)

Smith et al. impute missing values in a given record of an incomplete dataset by first estimating a regression model of the same form as the regression model (1) of the EM algorithm and then filling in missing values with the values that the estimated regression model predicts given the available values in the record. The way in which Smith et al. estimate the regression models entails certain limitations on the applicability of their technique, and these limitations also restrict the extent to which their technique can be compared with the regularized EM algorithm. To understand these limitations, it is necessary to consider the technique of Smith et al. in more detail.^{6}

Smith et al. estimate the coefficients of the regression model for a given record from a spatial covariance matrix that is the sample covariance matrix of a complete set of training data. The training dataset contains complete records of monthly mean surface temperature data for the years 1982–93, with the data given as anomaly data, that is, as the deviations of the monthly mean temperatures for a given month from the estimated climatological mean temperatures for that month. The spatial covariance matrix is estimated as the sample covariance matrix of the 144 monthly records for all months of the 12 years from 1982 through 1993. Smith et al. assume that a periodically varying mean suffices to account for the cyclostationary variability of the data and that the spatial covariance matrix of the anomaly data can be approximated by a stationary covariance matrix (cf. section 4c).

The matrix of regression coefficients **B** of the regression model (1) for a given record is computed from the estimated spatial covariance matrix by a regularization method known as truncated total least squares (Fierro et al. 1997), which leads to the estimate

where **T**_{aq} ∈ ℜ^{pa×q} and **T**_{mq} ∈ ℜ^{pm×q} are submatrices of the matrix **T** that contains as its columns the eigenvectors of the estimated spatial covariance matrix, and **T**^{+}_{aq} = (**T**^{T}_{aq}**T**_{aq})^{−1}**T**^{T}_{aq} is a pseudoinverse of the matrix **T**_{aq}. The submatrix **T**_{aq} consists of those rows of the first *q* eigenvectors **T**_{:,1:q} that belong to the variables for which, in the given record, the values are available, and the submatrix **T**_{mq} consists of those rows of the first *q* eigenvectors **T**_{:,1:q} that belong to the variables for which, in the given record, the values are missing (cf. the partitioning of the eigenvector matrix **T** in section 3a). The truncation parameter *q* is a discrete regularization parameter, which Smith et al. choose with an ad hoc technique. In truncated total least squares, the regression coefficients are linear combinations of the subvectors **T**_{aq} of the *q* leading eigenvectors of the estimated overall covariance matrix. Fierro et al. (1997) investigate properties of regularization by truncated total least squares, and Golub et al. (2000) show how considerations of data errors can give rise both to truncated total least squares and to ridge regression.

The spatial covariance matrix and the regression coefficients are estimated from a training dataset that spans only 12 years. To account for interdecadal variability of the monthly mean temperature, Smith et al. estimate the mean vector *μ* of the regression model (1) for a given record as an 11-yr running mean. For the estimation of the mean values of all variables in a given record, it is therefore necessary that sufficient data are available in the 11 years surrounding the year to which the given record belongs. The running mean is smoothed spatially, in this test with a radial Gaussian smoothing filter of standard deviation 500 km, in order to stabilize the estimation of the mean vector and to fill in missing values in the mean vector with nearby available mean values when, for any grid point, there are not sufficient data in an 11-yr window.

Thus, the technique of Smith et al. is based on different estimation and regularization procedures for the same regression model (1) that is used in the regularized EM algorithm. That the technique of Smith et al. requires the estimation of a running mean limits its applicability to datasets that do not contain variables for which values are missing in many consecutive records.

### b. Test data

To obtain incomplete datasets with which the regularized EM algorithm and the imputation technique of Smith et al. could be tested, some values were deleted in nine sets of simulated surface temperature data. The simulated datasets were obtained from nine integrations of a coupled climate model of low (R15) resolution [see Dixon and Lanzante (1999) for a description of the ensemble of integrations]. The climate model was developed at the Geophysical Fluid Dynamics Laboratory and, in the past decade, variants of it have been used in several studies to simulate anthropogenic climate change (e.g., Manabe et al. 1991; Manabe and Stouffer 1994). Monthly means of the simulated surface temperature data were interpolated to the 5° × 5° latitude–longitude grid of the set of merged land surface and sea surface temperature data of Jones (1994) and Parker et al. (1994, 1995). Corresponding to the locations and times for which values are missing in this set of observational temperature data, values were deleted in the nine sets of simulated temperature data.

Requirements on the spatial and temporal continuity of data coverage in a study of interdecadal climate change—the study that motivated the development of the regularized EM algorithm—and the limitations of the technique of Smith et al. led to some restrictions on the test datasets: only the simulated monthly mean temperatures for July were considered, and the datasets were restricted temporally to the 53 simulated years from 1946 through 1998 and spatially to the region between the 25°S and the 60°N latitude circles. Within this period and this region, grid points for which more than 30% of the temperature values had been deleted were excluded from the datasets. With these restrictions on the datasets, the spatially smoothed 11-yr running mean, required by the technique of Smith et al., was available for all variables and for all records. What resulted were nine test datasets, each consisting of *N* = 53 records with simulated values of the mean July surface temperature for *m* = 1156 grid points. In each of the nine datasets, 3.3% of the values were missing.

### c. Test results

With a regularized EM algorithm with individual ridge regressions, the temporal mean, the spatial covariance matrix, and the missing values of the simulated July temperatures were estimated from the incomplete test datasets. Only the spatial covariance matrix was estimated and exploited for the imputation of missing values (cf. section 4a). For each of the nine test datasets, the regularized EM algorithm was initialized with the mean vector estimated as the sample mean of all values available in the test dataset and with a covariance matrix estimated as the sample covariance matrix of the test dataset with estimated mean values substituted for missing values. The regularized EM algorithm was iterated until it reached the stopping criterion

where **X**^{t} is the data matrix with the imputed values of the *t*th iteration filled in for missing values, and the sums extend over all missing values. The stopping criterion was reached after 14–17 iterations.^{7}

Figure 1 shows the mean and the standard deviation of the root-mean-square (rms) relative imputation errors for the nine datasets after each of the first 15 iterations of the regularized EM algorithm. For a given set of test data, the rms relative imputation error after the *t*th EM iteration is defined as

where the matrix **X**^{c} contains the complete set of simulated surface temperature data from which values had been deleted to obtain the incomplete set of test data. The normalization constant *M* is the total number of missing values in a dataset, and *σ*^{c}_{j} is the standard deviation of the *j*th variable of the complete dataset **X**^{c}. The squared relative imputation error [the fraction inside the sum (26)] averaged over the nine datasets was spatially and temporally fairly homogeneous. Figure 1 shows that the rms relative imputation error averaged over the nine datasets decreased monotonically from iteration to iteration. Indeed, in the 14–17 iterations that the regularized EM algorithm needed to reach the stopping criterion, the rms relative imputation error was found to decrease monotonically for each individual dataset. With the test datasets, then, the regularized EM algorithm converged reliably, albeit slowly.

Smith et al. estimate the spatial covariance matrix needed by their imputation technique from a training dataset consisting of in situ temperature measurements that were completed with satellite data. For the comparison of their imputation technique with the regularized EM algorithm, the spatial covariance matrices needed by the technique of Smith et al. were estimated from complete datasets for all months of the simulated years 1982–93, after the climatological monthly mean of the simulated years 1961–90 had been subtracted from each month of the simulated data. To account for the expected decrease of the temperature variance with the area of the grid box that a variable of a dataset represents, the simulated data for each grid point were scaled by the square root of the cosine of the latitude of the grid point. (No such scaling is necessary in the regularized EM algorithm because the regularized EM algorithm is based on eigendecompositions of correlation matrices, not of covariance matrices.) The imputation technique of Smith et al. was tried with various truncation parameters *q* for the total least squares regularization. The results reported below belong to the truncation parameter *q* = 20 that, averaged over the nine datasets, yielded the smallest rms relative imputation error. Thus, in choosing the truncation parameter with the smallest rms relative imputation error, information from the complete datasets was used that would not ordinarily be available. The spatial covariance matrices, moreover, were estimated from training datasets that contained some of the values that were missing in the test datasets, whereas, in practice, the training dataset and the incomplete dataset in which missing values are to be imputed are distinct. Therefore, the conditions for the technique of Smith et al. were, in this test, more favorable than they would be in practice. The regularized EM algorithm, on the other hand, used only the kind of data that would have been available in practice, namely, only the incomplete test datasets with simulated mean temperatures for July.

Table 1 displays the relative errors that occurred with the regularized EM algorithm and with the technique of Smith et al. For each of the nine datasets, the rms relative imputation error *δ***X** of the regularized EM algorithm was smaller than the rms relative imputation error of the technique of Smith et al. That is, although the regularized EM algorithm used only the incomplete datasets of simulated July temperatures, it led to more accurate imputed values than the technique of Smith et al., which additionally required complete sets of training data with simulated temperatures for every month of a 12-yr period and for which the truncation parameter was chosen from the complete datasets such as to minimize the rms relative imputation error averaged over the nine datasets.

The standard errors of the values imputed with the regularized EM algorithm were estimated as described in the appendix. From the estimated standard errors, the estimated rms relative imputation error *δ***X**[fy15,1]d was calculated in analogy to the actual rms relative imputation error (26), with the estimated squared errors replacing the actual squared errors in the numerator of the relative imputation error. The comparison of the rms relative imputation error *δ***X** and of the estimated rms relative imputation error *δ***X**[fy15,1]d shows that the estimated imputation error tends to underestimate the actual imputation error. The estimated rms relative imputation error was, on the average, 11% smaller than the actual rms relative imputation error.

The underestimation of the imputation error points to a general difficulty in estimating errors in ill-posed problems. Error estimates in ill-posed problems depend on the regularization method employed and on the regularization parameter, but one rarely has a priori reasons, independent of the particular dataset under consideration, for the choice of a regularization method and a regularization parameter. In addition to the uncertainty about the adequacy of the regression model (1), the uncertainties about the adequacy of the regularization method and of the regularization parameter contribute to the imputation error. Since in the estimated imputation error, these uncertainties are neglected, the estimated imputation error underestimates the actual imputation error.

Since only 3.3% of the values in the test datasets were missing, estimating the mean of the records from the incomplete test datasets was unproblematic. The sample means of the complete datasets and the sample means of the completed test datasets with imputed values filled in for missing values did not differ significantly. In the test datasets, 500 of the *p* = 1156 variables had at least one missing value. The individual sample means of these 500 variables, estimated from the test datasets completed with the regularized EM algorithm and with the technique of Smith et al., fell within the 95% confidence intervals of the sample mean of the complete datasets. Thus, the mean values estimated with the regularized EM algorithm and with the technique of Smith et al. were statistically indistinguishable from the sample mean of the complete datasets.

Because of the small fraction of missing values in the test datasets, the differences between the variances estimated with the regularized EM algorithm from the incomplete test datasets and the sample variances of the complete datasets were also of a similar magnitude as the expected sampling errors of the sample variances. However, the mean of the differences between the variances estimated from the incomplete test datasets and the sample variances of the complete datasets was not zero. As a measure of the bias of the estimated variances, Table 1 displays the error

of the trace of the spatial covariance matrix **Σ̂** estimated with the regularized EM algorithm relative to the trace of the sample covariance matrix **Σ**^{c} of the complete set of simulated data. The spatial covariance matrix **Σ̂** is the spatial covariance matrix of simulated July temperatures that the regularized EM algorithm estimates and exploits for the imputation of missing values. That the trace of the estimated covariance matrix was, for all nine sets of test data, smaller than the trace of the sample covariance matrix of the complete datasets indicates that the regularized EM algorithm underestimates the variances. This underestimation of the variances is a consequence of using the residual covariance matrix of the regularized regression model in place of the unknown conditional covariance matrix of the imputation error (cf. section 3a). The residual covariance matrix of the regularized regression model underestimates the conditional covariance matrix of the imputation error for the same reason that the estimate of the imputation error in the appendix underestimates the actual imputation error: the error estimates neglect the uncertainties about the regularization method and the regularization parameter. To be sure, the traces of the estimated covariance matrices, on the average, have a relative error of only about 1.8%, but for datasets in which a greater fraction of the values is missing, the underestimation of the variances will be greater.^{8}

The test of the regularized EM algorithm demonstrates that the algorithm is applicable to typical sets of incomplete climate data and that it leads to more accurate estimates of the missing values than the technique of Smith et al., even though for the technique of Smith et al., the conditions of the test were more favorable than they would be in practice. The limitations of the technique of Smith et al. make it difficult to compare that technique with the regularized EM algorithm in a more complex test problem with a greater fraction of missing values. However, that the regularized EM algorithm already in this relatively simple test led to more accurate imputed values than the technique of Smith et al. suggests that in more complex tests the regularized EM algorithm would also perform better than conventional noniterative imputation techniques that resemble the one of Smith et al.

## 7. Summary and discussion

Two characteristics complicate the multivariate analysis of typical sets of climate data: most sets of climate data are incomplete, and they contain more variables than records from which the statistics of the data can be estimated. If an incomplete dataset has more records than variables so that it is of full rank, the EM algorithm can be used both to compute the maximum likelihood estimates of the statistics of the data and to fill in missing values with their conditional expectation values given the available values and the estimated statistics. But if an incomplete dataset, like most sets of climate data, has more variables than records so that it is rank-deficient, the conditional expectation values of the missing values are underdetermined and the EM algorithm cannot be used. The EM algorithm for Gaussian data, which is based on iterated linear regression analyses, was taken as the point of departure for the development of a regularized EM algorithm, in which ridge regression with generalized cross-validation replaces the ill-posed conditional maximum likelihood estimation of the regression parameters in the conventional EM algorithm. With the regularized EM algorithm, the mean and the covariance matrix can be estimated from incomplete datasets with more variables than records and missing values in such datasets can be filled in with imputed values.

Since replacing the conditional maximum likelihood estimation of the regression parameters with the estimation of regularized regression parameters by ridge regression is a mere heuristic, the regularized EM algorithm is no longer justified on grounds of general principles such as the maximum likelihood principle. The proofs that the conventional EM algorithm leads to consistent and unbiased estimators and converges monotonically (Little and Rubin 1987, chapter 7) are not transferable to the regularized EM algorithm. Nevertheless, in a test with sets of simulated surface temperature data, the regularized EM algorithm did converge reliably. The values imputed with the regularized EM algorithm were more accurate than the values imputed with a conventional noniterative imputation technique that exploits statistics estimated from a complete training dataset to fill in missing values in an incomplete dataset. But the test with the simulated data also showed that the variances estimated with the regularized EM algorithm are too small, a consequence of an underestimation of the imputation error that the ridge regression entails. Covariance matrices estimated with the regularized EM algorithm and statistics derived from them must therefore be interpreted cautiously, particularly when the fraction of missing values in an incomplete dataset is large.

The regularized EM algorithm differs from conventional imputation techniques for climate data in two respects. It rests on a construal of the problem of estimating statistics from incomplete datasets and imputing missing values as nonlinear, and ill-posed problems arising in the imputation of missing values are regularized adaptively. In conventional imputation techniques for climate data, the nonlinear problem of estimating statistics from incomplete data and imputing missing values is linearized by estimating the statistics of the data from a training dataset, a complete subset of the available data, and exploiting the estimated statistics for the imputation of missing values in the subset of the data in which there are values missing. In the regularized EM algorithm, the statistics are estimated from all available data, including the available values in records with missing values, which makes it unnecessary to distinguish between training datasets and incomplete datasets in which values are to be imputed. Moreover, in conventional imputation techniques, usually a single regularization parameter is chosen for the regularization of the ill-posed problems arising in the imputation of missing values in different records of an incomplete dataset. In the regularized EM algorithm, the ill-posed problems are regularized adaptively, with regularization parameters chosen by generalized cross-validation such as to adapt to the density and pattern of available values in the different records of an incomplete dataset.

The regularized EM algorithm can not only be applied to incomplete datasets containing values of a single geophysical field, such as values of the surface temperature at various grid points or at various stations, but it can also be used to construct historic surface temperature datasets from proxy data. The regularized EM algorithm estimates the correlations between the variables of a dataset and exploits the estimated correlations to fill in missing values with imputed values. Some of the variables in a dataset might represent surface temperature values at the nodes of a spatial grid and other variables might represent proxies of the surface temperature (cf. Mann et al. 1998). The regularized EM algorithm then would estimate the correlations between the temperature proxies and the temperature values at the various grid points and exploit these correlations to impute missing temperature values from available proxy data. Time series of spatial averages of the surface temperature field, regional or global averages, for example, could be computed as one-dimensional projections of the completed set of spatial temperature data. If the relative observational errors of the available temperature measurements and of the proxy variables are of different magnitudes, modifying the regularization method of the regularized EM algorithm so as to take the variance structure of the observational errors explicitly into account might lead to more accurate imputed temperature values than the regularization method presented above (cf. section 3a and Golub et al. 2000).

An extensive literature exists both on the EM algorithm and on ridge regression as a regularized regression method and on generalized cross-validation as a method of determining a regularization parameter. Ridge regression and generalized cross-validation as they are commonly presented had only to be adapted to make them fit into the framework of the EM algorithm. Given the ubiquity of ill-posed problems in the atmospheric and oceanic sciences, be it in the initialization of weather forecasts, in the detection of climate change, or in any of the numerous other problems that involve the solution of ill-conditioned or singular linear systems, it seems that regularization methods other than the widely used truncated principal component analysis and criteria such as generalized cross-validation for determining a regularization parameter deserve more attention than they currently receive. To facilitate experiments with the regularized EM algorithm and with the regularization methods it is composed of, the program code with which the test problems of this paper were computed has been made available online (Schneider 1999).

## Acknowledgments

This research project was supported by a NASA Earth System Science Fellowship. I thank Keith Dixon for providing the ensemble of climate simulations; Isaac Held, John Lanzante, Michael Mann, and Arnold Neumaier for comments on drafts of this paper; and Heidi Swanson for editing the manuscript.

## REFERENCES

_{2}. Part I: Annual mean response. J. Climate, 4, 785–818

### APPENDIX

#### Estimation of the Imputation Error

The derivation of the GCV function by Golub et al. (1979) suggests a heuristic for estimating the standard error of the imputed values. Golub et al. showed that the GCV function (20) is a cross-validatory estimate of the mean-squared error of predictions with an estimated ridge regression model, provided that the cross-validation is carried out after transforming the data to a basis in which the individual records of the dataset are strongly coupled (see also Wahba 1990, chapter 4). That the GCV function 𝒢(*h*) is an estimate of the predictive mean-squared error suggests taking, in the notation of section 3,

as a rough estimate of the squared standard error (Δ**x̂**_{m})^{2}_{j} of the imputed value (**x̂**_{m})_{j}.

However, using the GCV function value as an estimate of the squared imputation error is a heuristic without theoretical foundation. The theoretical foundation of generalized cross-validation is furnished by the fact that the regularization parameter that minimizes the GCV function is approximately equal to the regularization parameter that minimizes the expected mean-squared imputation error (cf. Wahba 1977; Golub et al. 1979). That does not, however, imply that the GCV function value itself is approximately equal to the expected mean-squared imputation error. The uncertainties about the adequacy of the regression model (1), of the regularization method, and of the regularization parameter all contribute to the imputation error, but the error estimate (A1) does not account for these uncertainties. Therefore, the estimated imputation error is merely a lower bound on the actual imputation error.

Nevertheless, the error estimate (A1) has a structure that makes it plausible as a heuristic lower bound on the imputation error. One of the factors *ñ*/𝒯(*h*_{j}) that is multiplied with the residual variance (**Ĉ**_{h})_{jj} can be interpreted as a bias-correction factor that corrects the assumed number of degrees of freedom in the estimate of the residual variance from *ñ* to 𝒯(*h*_{j}), thus taking into account the loss of degrees of freedom resulting from the estimation of the ridge regression coefficients. The other factor *ñ*/𝒯(*h*_{j}) that is multiplied with the residual variance (**Ĉ**_{h})_{jj} can be interpreted as a variance inflation factor that accounts for the sampling error of the ridge regression coefficients and thus of the imputed values.

## Footnotes

*Corresponding author address:* Tapio Schneider, Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY 10012.

Email: tapio@cims.nyu.edu

^{1}

**A**_{i:} denotes the *i*th row and **A**_{:j} the *j*th column of a matrix **A**. The index *i* has been omitted from the symbols whose affiliation to a specific record **X**_{i:} can be inferred from the context.

^{2}

In principle, it suffices to estimate one regression model for each pattern of missing values in a dataset, instead of estimating one regression model for each record. However, in sets of climate data with many variables, it is rare that two or more records have the same pattern of missing values, and so the computational effort of finding matching patterns of missing values will often exceed the computational savings that result from having to estimate fewer regression models.

^{3}

The submatrices of the covariance matrix estimate, as well as the estimates of other quantities appearing in what follows, depend on the EM iteration *t* and on the record **X**_{i:} under consideration. Nevertheless, the indexes *t* and *i* have been omitted from the symbols whose affiliation to a specific iteration and to a specific record can be inferred from the context.

^{4}

The GCV function is usually defined for a single dependent variable (i.e., *p*_{m} = 1), so that the data matrix **X**_{m} and the regression coefficients **B̂**^{′}_{h} are vectors, for which the Frobenius norm in the numerator of the GCV function (20) becomes the Euclidean norm. Replacing the Euclidean norm for vectors with the Frobenius norm for matrices leads to the extension of standard results presented here.

^{5}

That the spatiotemporal covariance matrix (25) has a block Toeplitz structure means that along its diagonals are identical copies of the spatial covariance matrix **Σ**_{0} or of the lag-1 covariance matrices **Σ**_{−1} and **Σ**_{1}. However, from the structure of the data matrix (24) one can infer that the estimate **Σ̂** of the covariance matrix (25) will not have a block Toeplitz structure. The estimates of the submatrices **Σ**_{0}, **Σ**_{−1}, and **Σ**_{1} along the diagonals of **Σ̂** will differ in that they include or exclude contributions of data vectors for the first instants (*ν* = 1 or 2) and for the final instants (*ν* = *N* or *N* − 1) of the dataset.

^{6}

Smith et al. do not describe their technique in terms of regression analysis, but the equivalence of their description and the one given here can be seen by expressing the formulas in the paper by Smith et al. in terms of matrix operations. The imputation technique of Mann et al. (1998) can similarly be interpreted as a regression analysis.

^{7}

Details of the implementation of the regularized EM algorithm can be taken from the program code, which is available online (Schneider 1999).

^{8}

The underestimation of the variances can be reduced by multiplying the residual covariance matrices **Ĉ**_{h} with a scalar inflation factor *α* > 1 before adding them to the cross products (9) of the imputed values. The inflation factor *α* might be determined from simulation results like the ones above.