## 1. Introduction

A major question in prediction theory is whether forecasts from different models can be combined to produce a single forecast with improved skill. One approach is to construct a weighted forecast with weights determined at each grid box by the method of least squares. Krishnamurti et al. (1999, 2000, 2001) investigated this “superensemble” method using centered forecasts as predictors (“centered” means that the sample average has been subtracted). Methods for constructing *probabilistic* forecasts also have been explored. Rajagopalan et al. (2002) proposed a Bayesian scheme in which the weights are determined by maximizing the log likelihood of the multimodel combination. A variant of this method plays a major role in the forecasts currently issued by the International Research Institute (Barnston et al. 2003). Raftery et al. (2005) proposed an ensemble postprocessing scheme based on Bayesian Model Averaging. Doblas-Reyes et al. (2005) explored a variety of methods based on variance inflation, field adjustment, and multiple regression. Other methods have been reviewed by Clemen (1989).

The mere existence of distinct methods for the same problem shows that the technology for combining forecasts is imperfect. These imperfections become clearly evident when the number of model forecasts is not a small fraction of the sample size. Thus, for instance, Robertson et al. (2004) report that as more models are added to the multimodel combination in the Rajagopalan et al. (2002) scheme, the resulting weight maps become more noisy and the weights often become exactly zero for all but one model, with the one model being chosen differently between neighboring grid boxes. Kharin and Zwiers (2002) show that the superensemble method does not perform as well as the simple multimodel mean as the number of models increases. Doblas-Reyes et al. (2005) found that the superensemble method for seven models improved over the simple multimodel model with 40 yr of data, but not with 20 yr.

The problems that arise from combining too many models are familiar consequences of overfitting—that is, to fitting variability due to sampling errors. To overcome these problems, Robertson et al. (2004) invoke a number of procedures, including averaging across data subsamples and spatial smoothing of the likelihood function. Kharin and Zwiers (2002) imposed constraints on the weighting coefficients. Yun et al. (2003) improved superensemble forecasts by solving the regression equations with singular value decomposition (SVD) and truncating all but the first few singular values. Van den Dool and Rukhovets (1994) and Peng et al. (2002) pool samples from different grid boxes by assuming that the weights are independent of space.

The purpose of this paper is to clarify the fact that a wide variety of methods for reducing overfitting in linear regression problems, including many of those mentioned above, can be interpreted in a single Bayesian framework. Bayesian theory allows one to incorporate “prior knowledge” in the estimation process. In this theory, different estimates are distinguished by different prior beliefs. The importance of Bayesian theory for dealing with ill-posed problems has long been recognized by statisticians. The present paper attempts to present this framework in a form suitable for climate scientists, shows its connection to previous methods employed in multimodel regressions, and applies it to several new, but reasonable, priors.

The next section reviews the linear regression model considered in this paper. The proposed framework is explained and illustrated with several special cases in section 3. The results of applying these estimates to the DEMETER hindcasts (Palmer et al. 2004) are discussed in sections 4–6. It is shown that the simple multimodel mean, which can be constructed without any of the insights of this paper, can beat the skill of all of the regression models investigated here. The significance of this result and a summary of the paper are given in the concluding section.

## 2. The regression model and ordinary least squares

**is an observable**

*y**N*-dimensional vector, called the

*predictand*;

*x*_{1},

*x*_{2}, . . . ,

*x*_{K}are

*K*observable

*N*-dimensional vectors, called the

*predictors*;

**is an unobservable**

*w**N*-dimensional vector representing random error; and

*β*

_{1,}

*β*

_{2}, . . . ,

*β*

_{K}are unknown weighting coefficients, called

*regression parameters*. The model (1) can be written in concise matrix form aswithwhere superscript T denotes the transpose operation. The matrix 𝗫 is called the

*design matrix*. The least squares estimate of

**, denoted**

*β**β*

_{LS}, minimizes the sum square errorComputing ∂(

*SSE*)/∂

**=**

*β***0**to find the stationary value gives the standard solution

In this paper, we seek a linear regression model of the form (1) in which the predictors are a set of forecasts. In this context, ** y** is the variable to be predicted;

*x*_{1},

*x*_{2}, . . . ,

*x**are*

_{K}*K*forecasts; and

*N*is the number of samples in a historical record. Technically, a constant term (i.e., the intercept term) should be added to the set of regression parameters. However, as discussed in the next section, we use standardized predictors, which renders the constant term negligible.

If an ensemble of forecasts from the same model are available, and these forecasts are exchangeable, then without loss of generality the forecasts *x*_{1}, *x*_{2}, . . . , *x _{K}* can be replaced by the

*ensemble mean forecast*. By definition, if the forecasts are exchangeable, then (1) must be invariant with respect to a permutation of forecasts from the same model. This invariance holds if and only if the regression parameters for the same model are identical, in which case the regression model can depend only on the sum of individual forecasts, which is proportional to the ensemble mean forecast. Thus, hereafter, we assume that the predictors

*x*_{1},

*x*_{2}, . . . ,

*x**comprise ensemble mean forecasts from the same model, and*

_{K}*K*denotes the number of distinct models.

## 3. Bayesian regression and constrained least squares

The fundamental distinction between ordinary least squares (LS) and Bayesian regression is that the latter associates a probability distribution with the regression parameters ** β**. This distribution, called a

*prior distribution p*(

**), quantifies the uncertainty in the parameters before data becomes available. The prior distribution is not a probability distribution in the sense that it can be estimated by repeated trials of an experiment, but rather in the sense that it quantifies our “degree of belief” in the value of**

*β***prior to data analysis (Jaynes 2003). Although prior assumptions are not always stated explicitly, they often exists nonetheless. For instance, one manifestation of overfitting is that the regression parameters**

*β**β*

_{1},

*β*

_{2}, . . . ,

*β*

_{K}vary by orders of magnitude, yet give excellent in-sample fits of the data. Experienced forecasters dismiss such models because they often produce poor forecasts of independent data and because the large values are deemed “unphysical.” Both objections can be characterized as prior beliefs.

*p*(

**), the distribution for the regression parameters after the data becomes available is computed from Bayes theorem aswhere**

*β**p*(

**|**

*β***) is called the**

*y**posterior distribution*, interpreted as the conditional distribution of the regression parameters given the specific data

**. The distribution**

*y**p*(

**|**

*y***) is the distribution of the noise term**

*β***in (1). The integral in the denominator is interpreted as a multivariate integral over the range of**

*w***. Technically, some of the above distributions also should be conditioned on 𝗫, but this dependence will be dropped since it complicates the notation without adding insight. When more data become available, the Bayes theorem can be applied again with the old posterior becoming the new prior. It can be shown that the final posterior, after all available data have been used, is independent of the order in which the data are entered (Box and Tiao 1973, p. 11). Thus, without loss of generality, we let**

*β***and 𝗫 represent all of the data at once.**

*y***in (1) is normally distributed with zero mean and covariance matrix**

*w**σ*

^{2}𝗜, and the regression parameter

**has a normal prior distribution with mean**

*β***and covariance matrix**

*μ**γ*

^{2}

**Σ**. (Writing the covariance matrix as

*γ*

^{2}

**Σ**allows us flexibility to consider structure and magnitude separately; we will often set

**Σ**= 𝗜 and consider only variations in

*γ*.) It is a standard result in Bayesian theory (Lindley and Smith 1972) that, since all the above distributions are normal, the posterior distribution also is normal with respective mean and covariance matrixIt is straightforward to verify that the posterior mean

**[**

*E***|**

*β***] converges to the least squares solution (5) in the limit**

*y**γ*→ ∞, demonstrating that ordinary least squares is the equivalent Bayesian estimation in the limit that the prior distribution has infinite uncertainty. Such a prior distribution is called a

*vague prior*, or an

*uninformative prior*. It is contradictory to assume infinite uncertainty if other knowledge about the observations is available. The above result elucidates the fact that ordinary least squares is precisely equivalent to assuming infinite uncertainty in our prior knowledge, whether the user recognizes it or not.

**. With constraint (9), the solution to the**

*μ**constrained least squares problem*can be obtained by the method of Lagrange multipliers by introducing the objective functionwhere

**is Lagrange multiplier to be determined. Solving for the stationary solution ∂**

*λ**L*/∂

**= 0 gives the regression estimate (7), provided that**

*β**λ*=

*σ*

^{2}/

*γ*

^{2}. The objective function

*L*can be interpreted as the sum square error (4) plus a “penalty term” that grows as the regression parameters deviate from the center of the ellipsoid (9). The parameter

*λ*measures the weight of the penalty term. From the Bayesian view,

*λ*measures the ratio of the uncertainty of a single prediction to the uncertainty in the regression parameters. The parameters

*λ*and

*c*in (9) and (10) are related, but the precise relation is immaterial since in practice these parameters are adjustable.

The objective function (10) is −2 log *p*(** β**|

**), aside from an irrelevant constant. Thus, the minimizer of**

*y**L*maximizes the posterior density and hence is the most probable set of coefficients. This fact shows that the constrained least squares problem follows naturally from Bayes theorem—one simply maximizes the posterior density derived from the Bayes theorem. Either the Bayesian estimate, the constrained least squares solution, or the maximum posterior solution can serve as a starting point for regression estimates. These equivalences prove useful for developing an intuitive understanding of the difference between the estimation methods.

*forecast assimilation*, in the sense of Stephenson et al. (2005). This connection can be seen more suggestively by invoking the matrix lemma in Lindley and Smith (1972) to rewrite (7) and (8) aswhere 𝗞 is the associated

*Kalman gain matrix:*This formulation can be interpreted as “updating” the prior

**based on data and “background” error covariance**

*μ***Σ**. The limit

*λ*→ ∞ gives 𝗞 =

**0**, which recovers the prior. The limit

*λ*→ 0, or more precisely

**Σ**→ ∞𝗜, can be shown to recover ordinary least squares (the proof is facilitated by invoking the matrix lemma mentioned above). Stephenson et al. (2005) suggest that the above equations provide the basis of a forecast assimilation procedure that converts predictions into calibrated forecasts, just as data assimilation converts observations into model fields. In most multimodel applications, (7) is preferable to (11) since the matrix to be inverted is of smaller order.

Experience suggests that it is advantageous to *standardize* predictors—that is, to center them and scale them to unit variance. We confirmed that, for our data, multimodels with standardized variables have higher skill than those based on unstandardized variables. Hereafter, we present results based only on standardized predictors. We normalize predictors so that their sum square equals unity, that is, 𝗫^{T}𝗫 is a correlation matrix. This normalization implies that the value *λ* = 1 corresponds to equal weighting on the constraint and error measures when **Σ** = 𝗜.

Finally, prior to data analysis, it is reasonable to assume that knowledge of a regression parameter for any one model tells us nothing about the parameter for any other model. This assumption effectively implies that the parameters are independent, that is, that **Σ** is diagonal. We shall make this assumption in the remainder of the paper.

### a. Ridge regression (R:0)

*c*is a positive number. Geometrically, (13) defines a sphere in

**space centered at the origin. Spherical symmetry is appropriate because the predictors have equal variances and come from “equally respectable institutions.” The corresponding constrained least squares problem can be solved by the method of Lagrange multipliers by introducing the objective functionand solving for the stationary solution ∂**

*β**L*/∂

**= 0, which gives the regression estimateEquivalently, we could express the prior belief that the regression parameters are not large by assuming a prior distribution for**

*β***with zero mean and covariance matrix**

*β**γ*

^{2}𝗜, where the parameter

*γ*measures “large.” The resulting estimate inferred from (7) is (15).

*ridge regression*. The parameter

*λ*is called the

*ridge parameter*. Ridge regression can give reasonable estimates even if the matrix 𝗫

^{T}𝗫 is ill conditioned. To see this, consider SVD of the design matrix 𝗫:where 𝗨 is an

*N*×

*K*matrix such that 𝗨

^{T}𝗨 = 𝗜, 𝗩 is an

*K*×

*K*unitary matrix such that 𝗩

^{T}𝗩 = 𝗜, and 𝗦 is a real diagonal

*K*×

*K*matrix with nonnegative diagonal elements

*s*

_{1}≥

*s*

_{2}≥, . . . , ≥

*s*. Substituting the SVD of 𝗫 into the ridge regression estimate (15) giveswhere 𝗗 is a diagonal matrix whose

_{k}*i*th diagonal element is given byIn ordinary least squares with

*λ*= 0, some diagonal elements of 𝗗 become unbounded as a singular value tends toward zero, leading to very large (and presumably unrealistic) values of

**. A nonzero value of**

*β**λ*prevents the diagonal elements from becoming unbounded and in fact makes them tend toward zero as the singular value tends toward zero. If the first few singular values are well separated from the others, then an intermediate value of

*λ*has virtually no impact on the leading singular vectors, recovering the least squares solution for those components, but damping the remaining components. This solution is a “smoother” version of the truncation method of Yun et al. (2003), in which all but the largest singular values are “zeroed out.”

For (15), the limit *λ* → ∞ corresponds to the “climatological forecast.” This follows from the fact that *λ* → ∞ implies that ** β** = 0, which in turn implies that

**y**= 0, which corresponds to the climatological mean since predictands are centered. Conversely, the limit

*λ*→ 0 corresponds to ordinary least squares. Hence, intermediate values of

*λ*give solutions that are “in between” ordinary least squares and the climatological forecast, with the degree of mixing controlled by

*λ*.

Van den Dool and Rukhovets (1994) have proposed ridge regression for constructing multimodel regressions. The connection between ridge regression, Bayesian theory, and constrained least squares has been noted by several authors (Hoerl and Kennard 1970; Lindley and Smith 1972; Goldstein and Smith 1974; Draper and Smith 1998, to name a few).

### b. Ridge regression with multimodel mean constraint (R:MM)

*K*. This constraint can be relaxed to the statement that the regression coefficients for centered variables should lie within the spherical boundary centered at 1/

*K*:where

**1**= [1 1 . . . 1]

^{T}and

*c*is a constant [not related to

*c*in (13)]. The case

*c*= 0 corresponds to a multimodel mean. The corresponding constrained least squares problem is solved by the method of Lagrange multipliers with objective functionThis function can be interpreted as a sum of the familiar sum square error and a penalty function that grows as the regression parameters deviate from the multimodel mean solution. The regression parameters minimizing this objective function obtained from ∂

*L*/∂

**= 0 areNote that this estimate differs from the ridge regression estimate (15) by an additive term, arising from the fact that the prior (19) was not centered at the origin. As a consistency check, setting**

*β***= 0 in (21) recovers the least squares solution (5), while in the limit**

*λ**λ*→ ∞ the solution (21) approaches

*β*

_{R}_{:MM}= 1/

*K*(as opposed to 0 for the case of ridge regression). It follows that intermediate values of

*λ*give a mix of these two limits. The above solution also is recovered from a Bayesian estimate in which the prior distribution for

**has mean and covariance matrixIn essence, this prior distribution assumes that, prior to data analysis, the regression parameters are expected to be centered about the multimodel mean with an uncertainty that scales with**

*β***.**

*γ*### c. Ridge regression with scaled multimodel mean (R:MM+R)

*K*. Mathematically, the model contains one scalar

*β̃*such thatThe least squares solution for this problem can be computed from (5) by substituting

**for**

*X*1**:In this paper, we relax the above constraint to the statement that the regression parameters lie within a spherical domain centered about an unknown point**

*X**δ*

**1**, where

*δ*is chosen to minimize the mean square error. The objective function for this constrained least squares problem isAgain, the first term on the right is the familiar sum square error while the second term is a penalty function that grows as the regression parameters deviate from the value

*δ*. Taking the derivative of the objective function with respect to

**and**

*β**δ*, and setting the result to zero, gives the solutionAgain, as

*λ*→ 0 this solution reduces to the least squares solution. The appendix shows that in the limit

*λ*→ ∞ the regression estimate approaches

*β̃*

**1**, where

*β̃*is given in (24).

Interestingly, the above solution can be obtain from a *hierarchical Bayesian theory* by introducing a higher-order prior and associated *hyperparameters*. Specifically, we assume as before that the random error term ** w** in (1) is normally distributed with zero mean and covariance matrix

*σ*

^{2}𝗜, but that the regression parameter

**has a normal prior distribution with mean**

*β**δ*

**1**and covariance matrix

*γ*

^{2}𝗜, and that the scalar

*δ*has a univariate normal distribution with mean

*ω*and variance

*ξ*

^{2}. The solution to this three-level hierarchical Bayesian problem is given in Lindley and Smith [1972, their (15)]. In the limit

*ξ*→ ∞, corresponding to a vague prior for the hyperparameters, the mean of the posterior density for the regression parameters is (26). The fact that the third-order hierarchical Bayesian solution can be interpreted as the solution to a suitable constrained least squares problem does not appear to have been previously recognized.

### d. Regression with signal-to-noise information (R:S2N)

**Θ**is a diagonal matrix whose diagonal elements are signal-to-noise ratios. Note that the predictors are standardized, so signals have unit variance and the diagonal elements of

**Θ**

^{−1}are normalized noise variances. The stationary point of this objective function is obtained when the regression parameter takes the valueHoerl and Kennard (1970) call this estimate a general form of ridge regression. Equivalently, the solution (28) can be derived from Bayesian analysis in which the regression parameters

**have a prior distribution with zero mean and covariance matrix**

*β***Θ**.

The above model does not assume a monotonic spread–skill relation. Rather, *R*:S2N simply damps predictors with small signal-to-noise ratio—a defensible procedure. Predictors with large signal-to-noise ratios also may be damped, though, if they do not add predictive skill.

All regression parameter estimates considered in this paper are summarized in Table 1.

### e. Choice of ridge parameter

A critical question in the above methods is how to choose the ridge parameter *λ*. Several methods for selecting the ridge parameter are reviewed in Draper and Van Nostrand (1979), Golub et al. (1979), and Smith and Campbell (1980). A fully Bayesian approach would assign prior distributions to the covariance matrices *σ*^{2}𝗜 and *γ*^{2}**Σ**. Unfortunately, this approach is computationally burdensome. Moreover, our goal is to make contact with other approaches that have been applied in the climate literature. Accordingly, instead of pursuing a purely Bayesian approach, we use Bayesian theory to derive all but a single parameter, and then apply selection techniques to choose the remaining parameter, as is usually done in ridge regression.

In this paper, we use cross validation as a basis for selecting the ridge parameter. A lucid review of cross validation can be found in Stone (1974) [see also Michaelsen (1987) for a review in a climate context]. Since we need to select a ridge parameter *and* estimate prediction error, ordinary cross validation is not adequate. We employ Stone’s “cross-validatory assessment of cross-validatory choice,” which, following a remark in his paper, we call *two-deep cross validation*. [This procedure appears to be called “double cross validation” in some papers, though this usage differs from that of Stone (1974). To avoid confusion, we use the above term.] The idea is to apply cross validation recursively in two stages. In the outer stage, we set aside one sample, construct a model based on the remaining *N* − 1 samples (using an inner stage described below), then test the resulting model on the set-aside sample. Repeating this procedure for all possible set-aside samples yields *N* forecast/verification pairs from which a measure of skill can be estimated. In the inner stage, ordinary cross validation is performed on the *N* − 1 samples left over by removing a sample in the outer stage, which yields a measure of prediction error, given *λ*. We then vary *λ* to find the value that minimizes the sum square error in the *N* − 1 samples, completing the inner stage. The final skill pertains to models with differing values of *λ*. This variability is not unlike a real forecast scheme that is updated with each passing forecast, and then assessed for overall skill at a later time.

**(**

*β̂**j*,

*λ*) be an estimate derived from all samples excluding the

*j*th sample, and let

**(**

*β̂**j*,

*k*,

*λ*) be an estimate derived from all samples excluding the

*j*th and

*k*th sample. Let

*λ*

_{0}(

*k*) be the value of

*λ*that minimizes the sum square error in the whole sample excluding the

*k*th sample:where a summation over

*a*is understood. Then, the prediction for the

*k*th sample isIn practice,

*λ*

_{0}(

*k*) is determined at each grid box by evaluating (29) for a sequence of

*λ*s starting at 0 and increasing by 0.1 until the value of 5 is reached, and then choosing the

*λ*with the smallest sum square residual (29). This procedure is applied at each grid box independently, so the selected ridge parameter

*λ*generally varies with grid box and with year

*k*.

*N*

^{2}regression estimates, which is a substantial computational burden. Remarkably, the sum square residuals (29) can be computed without explicitly computing

**(**

*β̂**j*,

*k*,

*λ*) for each

*k*. Adapting the equation for cross-validated error in Stone [1974, his (3.13)], it can be shown that (29) can be calculated equivalently aswhere

**Φ**

*is the*

_{k}**Φ**defined in (8) excluding the

*k*th sample. This equation turns out to be valid for all parameter estimates discussed in the paper, with

**and**

*β̂***Φ**identified by (7) and (8). The use of (31) instead of (29) results in more than an order of magnitude of computational savings.

*generalized cross validation,*as described by Golub et al. (1979) and Hansen (1998). The estimated prediction error based on generalized cross validation isAgain, the criterion is to chose the ridge parameter

*λ*that minimizes (32) in the inner stage.

## 4. Data

The data used in this paper are the hindcast integrations of the Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER) project. This dataset, reviewed in Palmer et al. (2004), consists of 6-month hindcasts by seven global coupled ocean–atmosphere models. The individual models are listed in Table 2. Each hindcast model produced a nine-member ensemble hindcast, all of which were averaged to construct ensemble mean hindcasts. Only hindcasts starting in the years 1980–2001 will be discussed in this paper, since all seven coupled model hindcasts were available for this period. In the notation of section 3, this implies that *N* = 22. The DEMETER hindcasts were initialized at 1 February, 1 May, 1 August, and 1 November and were integrated for the subsequent 6 months.

The variable considered in this paper is 2-m surface temperature over land. The predictability of this variable, when the sea surface temperature (SST) is known, has been examined comprehensively by Barnston and Smith (1996) and DelSole and Shukla (2006). These studies have shown that, depending on the region and season, temperature can be predicted with statistically significant skill even after four months. The present paper extends this work to consider the predictability of the coupled atmosphere–ocean system, when the SST is unknown.

Only 3-month averages are considered in this study. Three-month periods are denoted by the first letter of each respective month [e.g., January–March (JFM)]. The 3-month averages were computed from the monthly dataset available from the online data retrieval system at the European Centre for Medium-Range Weather Forecasts (ECMWF). Each 3-month period will be referred to as a “season.” If some monthly mean values are missing, the seasonal mean is computed using the remaining months. If all three months in a season are missing, then the corresponding grid box is dropped from the analysis. Although stringent, this criterion leaves the vast majority of grid boxes available for analysis.

The observational land surface temperature record used for verifying the hindcasts is the 5° × 5° gridded “HadCRUT2” dataset compiled jointly by the Climatic Research Unit (CRU) and the Met Office’s Hadley Centre (Jones and Moberg 2003; available from http://www.cru.uea.ac.uk/cru/data/temperature/).

All data were interpolated onto the 5° × 5° HadCRUT2 grid to facilitate comparison on a common grid. Although this grid is relatively coarse, it is not an unreasonable choice since seasonal hindcast models are not expected to accurately predict relatively small regions.

Hindcasts by the Max Planck Institute (MPI) model were found to be significantly biased and hence were not included as predictors. Nevertheless, the results are essentially the same if the MPI model is included.

## 5. Skill assessment

Following Barnston and Smith (1996), the measure of skill used in this study is the correlation coefficient between the hindcast and observation, at each grid box and at each season, during the 22-yr period 1980–2001. This metric is not affected by linear transformations of the forecast, which are often employed to correct for systematic biases or height differences between model and observation. For reference, the 95% and 99% critical levels for statistically significant correlation for 22 independent, normally distributed samples are 0.36 and 0.49, respectively.

To characterize the skill at several grid boxes, we follow Barnston and Smith (1996) and use the area-averaged correlation skill of selected regions around the globe. DelSole and Shukla (2006) compared this metric to several others, such as mean square error and localized mutual information, and found that these metrics often gave similar relative rankings. We consider this metric preferable to other metrics such as the pattern anomaly correlation, which is difficult to average sensibly over different verification periods, or the skill of large-scale spatial averaged fields, which fail to capture dipoles and other anticorrelated variability. The specific regions are summarized in Table 3 and illustrated in Fig. 1. The regions over land differ slightly from Barnston and Smith (1996) in that the domains used here do not overlap.

The statistical significance of spatially averaged correlation coefficients was determined by a bootstrap method (Efron and Tibshirani 1993) as follows. For a given season, say JJA, there exists 22 “maps” of the JJA mean field, one for each year in the 22-yr record. These maps were randomly selected (with replacement) to construct two 22-yr sequences of maps. Then, the point-by-point correlation was computed, treating one sequence as “verification” and the other as “forecast.” This procedure was repeated 10 000 times to compute an empirical distribution for the spatially averaged correlation skill for a fixed season, from which the 1% significance level was estimated. In most cases, the 1% significance level was around 0.22.

Many grid boxes turn out to have negative correlation skill. One might guess that a reasonable indicator of whether a regression model will have positive skill is whether the square error (31) is less than the error of a forecast based on the climatological mean. This assumption turns out to be incorrect, because (31) is an overly optimistic estimate of skill when it also is used to determine the optimal ridge parameter. Attempts to compensate for this bias by defining more stringent thresholds proved ineffective. In the results below, spatial maps will hide regions with statistically insignificant or negative skill, but the area average of these maps will include both positive and negative correlations, since in practice both signs would be present.

## 6. Results

The correlation skills of JAS 2-m temperature for the multimodel mean (MMM), LS, and R:MM for the period 1980–2001 are shown in Fig. 2. The skill is highest in eastern Brazil, Central America, Chile, western North America, the Mediterranean, North Asia, and coastal Australia, and the skill is lowest in central South American, Africa, and Australia. These features tend to be preserved for other initial conditions and lead times, though the amplitude of the correlations generally decreases with lead time.

The spatially averaged correlation skill of model hindcasts over the land areas defined in Table 3 are shown in Fig. 3. Also shown are the skills of the MMM and LS. The skills of individual models are not of interest here and hence are shown with indistinguishable line types. The figure reveals the familiar tendency for skill to decrease with lead time (recall that the initial conditions are states in February, May, August, and November). Also evident is the fact that no single model is superior to the others in all regions in all lead times. Hindcasts over the tropical land (TRP) and South America (SAM) tend to have larger skill than that over other land areas, especially at long lead times. In most cases, the skill of the multimodel mean tends to be at the top, while that of ordinary least squares tends to be the bottom, consistent with previous studies (Palmer et al. 2004; Doblas-Reyes et al. 2005).

The cross-validated mean-square error as a function of ridge parameter is illustrated in Fig. 4. The left panel shows an example in which a large ridge parameter, and hence a simple multimodel mean, gives the best hindcast. A pure least squares fit (*λ* = 0) in this case produces a hindcast that has normalized mean square error exceeding unity and hence is worse than climatology. The middle panel shows an example in which a small ridge parameter, and hence pure least squares, gives the best hindcast. In this case, a large ridge parameter (i.e., multimodel average) yields hindcasts that also have skill, but not optimal skill. These results reveal that even within the same forecast season, the multimodel performs better than ordinary least squares at some grid boxes, while the reverse is true at other grid boxes. The right panel shows an example in which an intermediate value of the ridge parameter gives the best hindcast.

The area-averaged correlation skills of different multimodel hindcasts are shown in Fig. 5. We see immediately that none of the hindcasts perform better than the simple MMM. Surprisingly, even the R:MM hindcast, which reduces to the multimodel mean for large ridge parameter, fails to achieve the same level of skill as the MMM. This discrepancy suggests that our method of selecting the ridge parameter is flawed. Accordingly, we repeated the analysis but used *generalized cross validation* in the inner stage of two-deep cross validation. While most of the resulting hindcasts were slightly better (not shown), none were as good as MMM.

To investigate the above problem further, we identified grid boxes in which MMM performed better than R:MM and examined the corresponding ridge parameter selected by our criterion. In most cases, the ridge parameter varied substantially from year to year. An extreme example is shown in Fig. 6. The top panel reveals that in this case the selected ridge parameter fluctuates randomly between small values (below 1) and large values (the maximum value is 5). Clearly, the selection criterion for the ridge parameter is sensitive to the year set aside. The bottom panel of Fig. 6 shows the normalized mean square error computed from the inner stage of two-deep cross validation for the years 1992–96 [i.e., the panel shows (31) as a function of *λ* for the six years]. The global minimum of each curve determines the ridge parameter that is selected in each year. Consistent with the upper panel for the years 1992–96, two of the curves have global minima less than one, and the rest have minima at the boundary *λ* = 5.

Interestingly, the global minima below *λ* = 1 are within 2% of the asymptotic error at *λ* = 5. Perhaps these global minima are not statistically different from the choice *λ* → ∞, in which case it might be reasonable to reject such minima. Unfortunately, the sampling distribution of the cross-validated error (29) is difficult to compute, so a rigorous statistical test would be difficult to formulate. However, to test the reasonableness of this formulation, we impose a constant ridge parameter by selecting the *median optimal ridge parameter*. The skills of the resulting hindcasts are shown in Fig. 7. In most cases the new R:MM hindcasts exceeds the skill of MM. Unfortunately, the median optimal ridge parameter is not a practical selection criterion because it utilizes the entire dataset, leaving no independent data for verification. Nevertheless, it provides a simple and clear demonstration that a more robust selection criterion (i.e., less sensitive to individual samples) can lead to improved skill.

## 7. Summary

This paper reviewed a theoretical framework for introducing prior assumptions in linear regression problems in order to reduce overfitting—that is, to reduce the fitting of variability due to sampling errors. Some previous methods for reducing overfitting, such as truncated SVD analysis, ridge regression, and constrained least squares, can be interpreted as special cases in this framework, each distinguished by different prior assumptions on the regression parameters. This framework is a part of a more general Bayesian methodology in which prior beliefs are expressed in the form of prior distributions on regression parameters. For Gaussian distributions, the results could be viewed equivalently as a constrained least squares problem in which prior beliefs are expressed in the form of constraints on the regression parameters. The corresponding constrained least squares problems are equivalent to minimizing a new cost function comprising the familiar sum square error, plus a “penalty term” that increases as the regression parameters diverge from the appropriate constraint. The magnitude of the penalty term is controlled by a parameter, called the ridge parameter. In the Bayesian framework, this parameter corresponds to the ratio of the prior variance to the variance of the predictand given the predictors. This parameter can be chosen not only to give the unconstrained least squares solution or the fully constrained solution, but also, at intermediate values, a mixture of the two solutions.

The following prior beliefs were explored: 1) regression parameters are “close” to zero, denoted *R*:0; 2) regression parameters are close to the multimodel mean, that is, each parameter is close to 1/*K* where *K* is the number of forecasts, denoted *R*:MM; 3) regression parameters are close to a single value consistent with ordinary least squares, denoted *R*:MM+*R*; 4) regression parameters are damped to zero inversely proportional to the signal-to-noise ratio of the corresponding forecast, denoted *R*:S2N. The estimate *R*:0 is equivalent to ridge regression. The other estimates can be considered generalizations of ridge regression.

The above regression estimates were tested on the DEMETER hindcasts of 2-m temperature over land. The MMM could predict this variable with statistically significant skill for at least three months. The spatially averaged correlation skill at three months generally exceeds 0.4, and individual 5° × 5° grid boxes can have correlations exceeding 0.8.

Remarkably, none of the proposed regression schemes were able to beat the average skill of the simple MMM. That this occurred despite the fact that one of the schemes recovers the multimodel mean in the limit of large ridge parameter clearly reveals a deficiency in our ability to select an appropriate ridge parameter. We explored two criteria for selecting the ridge parameter: a two-deep cross validation procedure with the inner stage based on 1) ordinary cross validation or 2) generalized cross validation. Cases in which the skill of R:MM fell below that of MMM tended to be associated with ridge parameters that fluctuated strongly from year to year. These fluctuations usually arise when the sum square error is a weak function of the ridge parameter (i.e., the error curve is “flat”). In such cases, sampling errors lead to slight curvature changes, but dramatically different minima. In general, selecting a parameter because it minimizes some data-derived cost function could be problematical if the minimum is not statistically distinguishable from other values. This issue is pertinent to all multimodel regression schemes with tunable parameters. A related problem is that, surprisingly, we were unable to formulate a criterion based on the training sample for predicting the grid boxes that have negative skill. (“Three-deep” cross validation suggests itself here, but this approach does not address the sampling issue discussed above and is computationally prohibitive.)

Evidence was presented to show that the skill of regression schemes could be improved if the underlying selection criterion was more stable with respect to sampling errors. The question arises as to how the stability of selection criteria can be enhanced. One clue is the fact that the optimal ridge parameter varied greatly in space; for instance, the selected ridge parameter can vary from its lowest possible value to its highest possible value between neighboring grid boxes. This small-scale variability seems unphysical in light of the large-scale coherence of most predictable patterns on seasonal time scales. Perhaps, then, information from neighboring grid boxes should be incorporated into the regression scheme. Van den Dool and Rukhovets (1994) do this by assuming that the regression parameters are constant in space, in which case the data from different grid boxes can be pooled, thereby reducing the overfitting. Robertson et al. (2004) apply an ad hoc spatial smoother to their likelihood function before it is maximized to find the regression parameters. It should be recognized, however, that the rejection of regression schemes on the basis that they produce noisy regression parameters reflects an underlying prior belief that the parameters should be “large scale.” This reasoning naturally suggests a Bayesian approach in which the anticipated large-scale structure of the regression parameters is expressed through a suitable prior distribution, with the degree of spatial coherence controlled by a small number of parameters in the prior distribution (which must be estimated from data). It seems plausible that borrowing strength from training data at neighboring grid boxes can stabilize the parameter estimates and reduce overfitting, resulting in improved performance of the more sophisticated regression schemes relative to the simple multimodel mean.

Stimulating comments from Michael Tippett and Huug Van den Dool lead to significant improvements in the model selection and assessment methodologies employed in this paper. Jennifer Adams provided helpful assistance on the figures. Comments from David Stephenson and the reviewers also lead to improvements on the presentation. We gratefully acknowledge the multimodel ensemble hindcast dataset provided freely and conveniently by the DEMETER project. The particular surface temperature dataset used in this study was provided by Kyung Emilia Jin. This research was supported by the National Science Foundation (ATM0332910), National Aeronautics and Space Administration (NNG04GG46G), and the National Oceanographic and Atmospheric Administration (NA04OAR4310034).

## REFERENCES

Barnston, A. G., , and T. M. Smith, 1996: Specification and prediction of global surface temperature and precipitation from global SST using CCA.

,*J. Climate***9****,**2660–2697.Barnston, A. G., , S. J. Mason, , L. Goddard, , D. G. DeWitt, , and S. E. Zebiak, 2003: Multimodel ensembling in seasonal climate forecasting at IRI.

,*Bull. Amer. Meteor. Soc.***84****,**1783–1796.Box, G. E. P., , and G. C. Tiao, 1973:

*Bayesian Inference in Statistical Analysis*. Addison-Wesley, 588 pp.Clemen, R. T., 1989: Combining forecasts: A review and annotated bibliography.

,*Int. J. Forecasting***5****,**559–583.DelSole, T., 2005: Predictability and information theory. Part II: Imperfect forecasts.

,*J. Atmos. Sci.***62****,**3368–3381.DelSole, T., , and J. Shukla, 2006: Specification of wintertime North American surface temperature.

,*J. Climate***19****,**2691–2716.Doblas-Reyes, F. J., , R. Hagedorn, , and T. N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting. Part II: Calibration and combination.

,*Tellus***57A****,**234–252.Draper, N. R., , and R. C. Van Nostrand, 1979: Ridge regression and James Stein estimation: Review and comments.

,*Technometrics***21****,**451–466.Draper, N. R., , and H. Smith, 1998:

*Applied Regression Analysis*. 3d ed. John Wiley and Sons, 706 pp.Efron, B., , and R. J. Tibshirani, 1993:

*An Introduction to the Bootstrap*. Chapman and Hall, 436 pp.Goldstein, M., , and A. F. M. Smith, 1974: Ridge-type estimators for regression analysis.

,*J. Roy. Stat. Soc.***36B****,**284–291.Golub, G. H., , M. Heath, , and G. Wahba, 1979: Generalized cross-validation as a method for choosing a good ridge parameter.

,*Technometrics***21****,**215–223.Hansen, P. C., 1998:

*Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion. SIAM Monogr. on Mathematical Modeling and Computation,*Society for Industrial and Applied Mathematics, 247 pp.Hoerl, A. E., , and R. W. Kennard, 1970: Ridge regression: Applications to non-orthogonal problems.

,*Technometrics***12****,**69–82.Horn, R. A., , and C. R. Johnson, 1985:

*Matrix Analysis*. Cambridge University Press, 561 pp.Jaynes, E. T., 2003:

*Probability Theory: The Logic of Science*. Cambridge University Press, 727 pp.Jones, P. D., , and A. Moberg, 2003: Hemispheric and large-scale surface air temperature variations: An extensive revision and an update to 2001.

,*J. Climate***16****,**206–223.Kharin, V. V., , and F. W. Zwiers, 2002: Climate predictions with multimodel ensembles.

,*J. Climate***15****,**793–799.Krishnamurti, T. N., , C. M. Kishtawal, , T. E. LaRow, , D. R. Bachiochi, , Z. Zhang, , C. E. Williford, , S. Gadgil, , and S. Surendran, 1999: Improved weather and seasonal climate forecasts from multimodel superensemble.

,*Science***285****,**1548–1550.Krishnamurti, T. N., , C. M. Kishtawal, , Z. Zhang, , T. E. LaRow, , D. R. Bachiochi, , C. E. Williford, , S. Gadgil, , and S. Surendran, 2000: Multimodel ensemble forecasts for weather and seasonal climate.

,*J. Climate***13****,**4196–4216.Krishnamurti, T. N., and Coauthors, 2001: Real-time multianalysis—multimodel superensemble forecasts of precipitation using TRMM and SSM/I products.

,*Mon. Wea. Rev.***129****,**2861–2883.Lindley, D. V., , and A. F. M. Smith, 1972: Bayes estimates for the linear model (with discussion).

,*J. Roy. Stat. Soc.***34B****,**1–41.Michaelsen, J., 1987: Cross-validation in statistical climate forecast models.

,*J. Climate Appl. Meteor.***26****,**1589–1600.Palmer, T. N., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal-to-Interannual Prediction (DEMETER).

,*Bull. Amer. Meteor. Soc.***85****,**853–872.Peng, P., , A. Kumar, , H. Van den Dool, , and A. G. Barnston, 2002: An analysis of multimodel ensemble predictions for seasonal climate anomalies.

,*J. Geophys. Res.***107****.**4710, doi:10.1029/2002JD002712.Raftery, A. E., , T. Gneiting, , F. Balabdaoui, , and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133****,**1155–1174.Rajagopalan, B., , U. Lall, , and S. E. Zebiak, 2002: Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles.

,*Mon. Wea. Rev.***130****,**1792–1811.Robertson, A. W., , U. Lall, , S. E. Zebiak, , and L. Goddard, 2004: Improved combination of multiple atmospheric GCM ensembles for seasonal prediction.

,*Mon. Wea. Rev.***132****,**2732–2744.Smith, G., , and F. Campbell, 1980: A critique of some ridge regression methods.

,*J. Amer. Stat. Assoc.***75****,**74–86.Stephenson, D. B., , C. A. S. Coelho, , F. J. Doblas-Reyes, , and M. Balmaseda, 2005: Forecast assimilation: A unified framework for the combination of multi-model weather and climate predictions.

,*Tellus***57A****,**253–264.Stone, M., 1974: Cross-validatory choice and assessment of statistical predictions.

,*J. Roy. Stat. Soc.***36A****,**111–147.Van den Dool, H. M., , and L. Rukhovets, 1994: On the weights for an ensemble-averaged 6–10-day forecast.

,*Wea. Forecasting***9****,**457–465.Yun, W. T., , L. Stefanova, , and T. N. Krishnamurti, 2003: Improvement of the multimodel superensemble technique for seasonal forecasts.

,*J. Climate***16****,**3834–3840.

# APPENDIX

## Asymptotics of R:MM+R

*λ*→∞. First, note that the matrixis

*idempotent*, which means that 𝗛

^{2}= 𝗛. By invoking standard properties of idempotent matrices, it can be shown that 𝗛 has rank

*K*− 1, and hence is singular. Furthermore, the vector

**1**lies in the null space of 𝗛; that is, 𝗛

**1**=

**0**.

^{T}𝗫 is a positive definite by assumption, and 𝗛 is symmetric, there exists a linear transformation

**that diagonalizes 𝗫**

*Z*^{T}𝗫 and 𝗛 simultaneously, in the sense thatwhere

**Ψ**is a real diagonal matrix (Horn and Johnson 1985, p. 250). According to the previous paragraph, one of the diagonal elements of

**Ψ**must vanish. Since the order of the diagonal elements is not unique, let the first diagonal element be 0. Thus,where diag [

*d*

_{1},

*d*

_{2}, . . . ,

*d*] denotes a square, diagonal matrix with diagonal elements

_{K}*d*

_{1},

*d*

_{2}, . . . ,

*d*. This ordering implies that the first column vector of 𝗭, denoted

_{K}

*z*_{1}, must be proportional to the one-vector

**1**, since this is the only vector that can produce a zero in the first diagonal element. The normalization (A2) in fact shows that the first column vector must beAssuming that 𝗭 is invertible, the identities (A2) can be solved for 𝗭 to express the regression estimate (26) in terms of 𝗭. The result isSince

**Ψ**is diagonal, we haveIt follows thatUsing (A7) to take the limit of (A5) givesSubstituting the definition of

*z*_{1}(A4) into (A8) gives the limit (24).

Summary of the different multimodel ridge regressions. The notation *N*(** μ**,

**Σ**) denotes a normal distribution with mean

**and covafiance matrix**

*μ***Σ**. See section 2 for further details.

Identification letters, modeling institution, and country of origin of the seven coupled ocean–atmosphere models whose hindcasts in the DEMETER project were used in this study.

The designation and boundaries of regions used to compute area-averaged correlation skill. Only land points are included in the defined regions.