• Barnston, A. G., , and H. M. van den Dool, 1993: A degeneracy in cross-validated skill in regression-based forecasts. J. Climate, 6, 963977.

    • Search Google Scholar
    • Export Citation
  • DelSole, T., 2007: A Bayesian framework for multimodel regression. J. Climate, 20, 28102826.

  • DelSole, T., , and J. Shukla, 2006: Specification of wintertime North America surface temperature. J. Climate, 19, 26912716.

  • Hastie, T., , R. Tibshirani, , and J. H. Friedman, 2003: Elements of Statistical Learning. Corrected ed. Springer, 552 pp.

  • Kistler, R., and Coauthors, 2001: The NCEP–NCAR 50-Year Reanalysis: Monthly means CD-ROM and documentation. Bull. Amer. Meteor. Soc., 82, 247267.

    • Search Google Scholar
    • Export Citation
  • Peña, M., , and H. van den Dool, 2008: Consolidation of multimodel forecasts by ridge regression: Application to Pacific surface temperature. J. Climate, 21, 65216538.

    • Search Google Scholar
    • Export Citation
  • Penland, C., , and P. D. Sardeshmukh, 1995: The optimal-growth of tropical sea-surface temperature anomalies. J. Climate, 8, 19992024.

  • Phelps, M. W., , A. Kumar, , and J. J. O'Brian, 2004: Potential predictability in the NCEP CPC dynamical seasonal forecast system. J. Climate, 17, 37753785.

    • Search Google Scholar
    • Export Citation
  • Press, W. H., , S. A. Teukolsky, , W. T. Vetterling, , and B. P. Fannerty, 1992: Numerical Recipes. Cambridge University Press, 693 pp.

  • Robertson, A. W., , U. Lall, , S. E. Zebiak, , and L. Goddard, 2004: Improved combination of multiple atmospheric GCM ensembles for seasonal prediction. Mon. Wea. Rev., 132, 27322744.

    • Search Google Scholar
    • Export Citation
  • van den Dool, H., , and L. Rukhovets, 1994: On the weights for an ensemble-averaged 6–10-day forecast. Wea. Forecasting, 9, 457465.

  • Weisheimer, A., and Coauthors, 2009: ENSEMBLES: A new multi-model ensemble for seasonal-to-annual prediction—Skill and progress beyond DEMETER forecasting tropical Pacific SSTs. Geophys. Res. Lett.,36, L21711, doi:10.1029/2009GL040896.

  • Yun, W. T., , L. Stefanova, , and T. N. Krishnamurti, 2003: Improvement of the multimodel superensemble technique for seasonal forecasts. J. Climate, 16, 38343840.

    • Search Google Scholar
    • Export Citation
  • Zwiers, F. W., 1987: A potential predictability study conducted with an atmospheric general circulation model. Mon. Wea. Rev., 115, 29572974.

    • Search Google Scholar
    • Export Citation
  • View in gallery

    The penalty function Cp, for p = 2, evaluated for the gravest spherical harmonics as a function of total wavenumber.

  • View in gallery

    Weights for the Action de Recherche Petite Echelle Grande Echelle (ARPEGE) model derived from ordinary least squares, scaled multimodel mean, pointwise ridge regression, and scale-selective ridge regression, for predicting NDJ 2-m temperature using the ENSEMBLES hindcasts during 1960–2005. The hindcasts were initialized in November. The power parameter for the scale-selective ridge is p = 2.

  • View in gallery

    Squared error skill score of cross-validated forecasts from the ordinary least squares regression, scaled multimodel mean model, pointwise ridge regression, and scale-selective regression (p = 2) for predicting NDJ 2-m temperature. The multimodel ensemble consists of five coupled AOGCMs initialized in November during 1960–2005.

  • View in gallery

    The skill of multimodel hindcasts of NDJ 2-m temperature using scale-selective ridge regression (solid) and pointwise ridge regression (dashed) as a function of the ridge parameter λ. Also shown is the skill of regressions in which the weights depend on space but not model (circle) and depend on model but not space (diagonal cross). Skill is measured by the SESS. Ordinary least squares regression corresponds to λ = 0, and regression using a single weight for all models and space points corresponds to λ = ∞. The power parameter for the scale-selective ridge is p = 2.

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 35 35 5
PDF Downloads 23 23 6

Scale-Selective Ridge Regression for Multimodel Forecasting

View More View Less
  • 1 George Mason University, Fairfax, Virginia, and Center for Ocean-Land-Atmosphere Studies, Calverton, Maryland
  • | 2 Center for Ocean-Land-Atmosphere Studies, Calverton, Maryland
  • | 3 International Research Institute for Climate and Society, Palisades, New York, and Center of Excellence for Climate Change Research, Department of Meteorology, King Abdulaziz University, Jeddah, Saudi Arabia
© Get Permissions
Full access

Abstract

This paper proposes a new approach to linearly combining multimodel forecasts, called scale-selective ridge regression, which ensures that the weighting coefficients satisfy certain smoothness constraints. The smoothness constraint reflects the “prior assumption” that seasonally predictable patterns tend to be large scale. In the absence of a smoothness constraint, regression methods typically produce noisy weights and hence noisy predictions. Constraining the weights to be smooth ensures that the multimodel combination is no less smooth than the individual model forecasts. The proposed method is equivalent to minimizing a cost function comprising the familiar mean square error plus a “penalty function” that penalizes weights with large spatial gradients. The method reduces to pointwise ridge regression for a suitable choice of constraint. The method is tested using the Ensemble-Based Predictions of Climate Changes and Their Impacts (ENSEMBLES) hindcast dataset during 1960–2005. The cross-validated skill of the proposed forecast method is shown to be larger than the skill of either ordinary least squares or pointwise ridge regression, although the significance of this difference is difficult to test owing to the small sample size. The model weights derived from the method are much smoother than those obtained from ordinary least squares or pointwise ridge regression. Interestingly, regressions in which the weights are completely independent of space give comparable overall skill. The scale-selective ridge is numerically more intensive than pointwise methods since the solution requires solving equations that couple all grid points together.

Current affiliation: NOAA/Geophysical Fluid Dynamics Laboratory, Princeton, New Jersey, and University Corporation for Atmospheric Research, Boulder, Colorado.

Corresponding author address: Timothy DelSole, Center for Ocean-Land-Atmosphere Studies, 4041 Powder Mill Rd., Suite 302, Calverton, MD 20705. E-mail: delsole@cola.iges.org

Abstract

This paper proposes a new approach to linearly combining multimodel forecasts, called scale-selective ridge regression, which ensures that the weighting coefficients satisfy certain smoothness constraints. The smoothness constraint reflects the “prior assumption” that seasonally predictable patterns tend to be large scale. In the absence of a smoothness constraint, regression methods typically produce noisy weights and hence noisy predictions. Constraining the weights to be smooth ensures that the multimodel combination is no less smooth than the individual model forecasts. The proposed method is equivalent to minimizing a cost function comprising the familiar mean square error plus a “penalty function” that penalizes weights with large spatial gradients. The method reduces to pointwise ridge regression for a suitable choice of constraint. The method is tested using the Ensemble-Based Predictions of Climate Changes and Their Impacts (ENSEMBLES) hindcast dataset during 1960–2005. The cross-validated skill of the proposed forecast method is shown to be larger than the skill of either ordinary least squares or pointwise ridge regression, although the significance of this difference is difficult to test owing to the small sample size. The model weights derived from the method are much smoother than those obtained from ordinary least squares or pointwise ridge regression. Interestingly, regressions in which the weights are completely independent of space give comparable overall skill. The scale-selective ridge is numerically more intensive than pointwise methods since the solution requires solving equations that couple all grid points together.

Current affiliation: NOAA/Geophysical Fluid Dynamics Laboratory, Princeton, New Jersey, and University Corporation for Atmospheric Research, Boulder, Colorado.

Corresponding author address: Timothy DelSole, Center for Ocean-Land-Atmosphere Studies, 4041 Powder Mill Rd., Suite 302, Calverton, MD 20705. E-mail: delsole@cola.iges.org

1. Introduction

Several operational forecast centers issue predictions of weather and climate. The availability of multiple forecasts from different institutions raises the question of whether the forecasts can be combined to increase skill and reliability. Although several multimodel prediction systems have been proposed in the past, the relatively short datasets available for calibration lead to serious problems with overfitting. Consequently, a variety of approaches have been proposed to mitigate overfitting, including truncated singular value decomposition (SVD) analysis, ridge regression, and constrained least squares (Yun et al. 2003; van den Dool and Rukhovets 1994; DelSole 2007). These procedures often are applied on a point-by-point basis.

Recently, it has been recognized that seemingly different approaches to multimodel forecasting are actually special cases of a single Bayesian methodology, each distinguished by different prior assumptions on the model weights (DelSole 2007). This recognition has two important implications. First, it clarifies how each multimodel method fits within a unified framework. Second, it provides a mathematically rigorous and consistent methodology for incorporating prior information into the multimodel strategy. For instance, ridge regression can be interpreted as Bayesian regression under the prior assumption that the sum square weights are bounded; truncated SVD analysis can be interpreted as Bayesian regression under the prior assumption that the weights lie in the same space as the leading principal components of the predictors. Other reasonable priors, such as that the weights should be close to the multimodel mean, were suggested and implemented by DelSole (2007).

In this paper, we develop a multimodel regression based on a new prior assumption, namely that the forecast field should be a smooth function of space. This constraint is motivated by the fact that seasonally predictable patterns tend to be large scale. However, even if the forecast fields themselves are smooth, ordinary least squares regression applied to such forecasts typically generate noisy weights and therefore noisy predictions. Constraining the weights to be smooth ensures that the multimodel combination is no less smooth than the individual model forecasts. There is no unique way to account for this prior information. For instance, one could apply spatial smoothing to certain steps in a multimodel procedure (Robertson et al. 2004). However, intuitively, it would seem that weights need not be smoothed strongly where the regression fits the data well. Alternatively, one could pool neighboring points to estimate the weight at a single point, as proposed by Peña and van den Dool (2008). This approach is a local linear regression method with a constant kernel (Hastie et al. 2003, chapter 6). One could consider more general kernels that die off smoothly with distance from the target point but the kernel still would be constant, independent of goodness of fit. The purpose of this paper is to develop a new multimodel prediction system in which the weights are constrained to vary smoothly in space but with the degree of smoothness being balanced against goodness of fit.

2. Statement of the problem

The basic goal is to predict an observed field given a set of forecast fields. Let the predictand fields be denoted by the N-dimensional vector
e1
where denotes the predictand for the nth sample (usually identified with the year of a seasonal forecast) at the sth spatial grid cell. Similarly, let the forecast fields at the sth spatial grid cell be denoted by the N × M dimensional matrices
e2
where denotes the forecast for the nth sample for the mth model and sth spatial grid cell (forecasts and observations are on the same grid). We seek a prediction of based on a linear combination of forecast fields:
e3
where are unknown weights. As is implicit in this equation, we consider only pointwise combinations in which the prediction at a point depends only on the forecasts at that point—no correction of forecast patterns is considered, which would involve considerably more parameters to estimate from data. We assume that the predictands and forecasts are centered. The problem is to estimate the weights given a realization of predictands s and corresponding forecasts s. The ordinary least squares (OLS) method determines weights that minimize the sum square residual
e4
The minimizing value of is found by differentiating the sum square residual with respect to and setting the result to zero. This standard procedure leads to the set of equations
e5
where superscript T denotes the transpose, s = 1, 2, …, S, and we have defined the M-dimensional vector
e6
Note that the “normal equation” (5) for a particular spatial grid cell s is decoupled from the normal equations for other spatial grid cells. This decoupling implies that the weights can be determined by minimizing the residual for each value of s separately from other values of s—that is, by fitting a linear combination of forecasts to observations at each grid cell individually and independently. It follows that the desired weights are
e7
The above solution will be called the ordinary least squares solution.

The ordinary least squares solution often yields weights that vary considerably in space and perform relatively poorly in independent data, suggesting problems with overfitting (this will be illustrated in the results section). To mitigate overfitting, one might use as predictors only a small number of the leading principal components of the forecasts. This approach is formally equivalent to that proposed by Yun et al. (2003) and discussed in Press et al. (1986) based on the SVD. Another approach is to apply ridge regression (van den Dool and Rukhovets 1994). Both approaches involve empirical parameters: the SVD approach requires specifying the number of principal components, while ridge regression requires specifying the value of the ridge parameter. These approaches usually are applied at each grid point separately, with no explicit constraint on the spatial structure of the weights.

An important insight is that the above two methods are special cases of a more general Bayesian methodology (DelSole 2007). Specifically, both approaches arise when the errors are Gaussian and the weights are constrained: ridge regression arises when the weights are constrained to have prespecified L2 norm, while the principal component approach arises when certain combination of weights are constrained to vanish.

3. Proposed strategy

Although the constraints implicit in the above methods seem reasonable, other prior information may be worth taking into account. Specifically, numerous studies show that seasonally predictable structures tend to be large scale (Zwiers 1987; Penland and Sardeshmukh 1995; Phelps et al. 2004; DelSole and Shukla 2006). This fact suggests that the weights used to combine different model forecasts also should be large scale. Indeed, most experienced forecasters probably would reject a combination whose weights vary strongly in space. Also, numerical theory independently indicates that dynamical models are least reliable at the shortest length scales. Accordingly, rapidly varying weights designed to extract information from small-scale structure in the forecasts would seem to be of dubious value. For these and other reasons, it seems reasonable to assume that the weights should be large scale so that they do not introduce unpredictable small-scale noise. We would like to include this prior information into the regression.

A Bayesian framework offers the most general framework for incorporating prior information into regression. Unfortunately, this framework involves a cumbersome manipulation of probability distributions. The end result, however, is equivalent to solving a constrained least squares problem, which itself is intuitive and straightforward. Therefore, we describe our multimodel regression strategy in terms of a constrained least squares problem. The proposed strategy is a special case of Tikhonov regularization. For a suitable choice of constraint, the proposed solution reduces to standard ridge regression. However, we propose a constraint that suppresses small-scale variability in the weights more than large-scale variability. Accordingly we call the proposed approach scale-selective ridge regression.

a. Incorporating prior information from physical insight

The OLS problem is to find the weighting coefficients that minimize (4). We now generalize the problem by including a “penalty function” as follows:
e8
where is a symmetric matrix describing the penalty applied to the spatial structure of the weights, and λ is a penalty parameter that controls the overall strength of the constraint. The second term in (8) is a new penalty function that constrains the spatial structure of the weights. The proposed constraint does not depend on model, as reflected by the absence of an m index on . The regression design can be generalized in several directions, including allowing model dependence in the above parameters and assuming weights to be applied to a patch of grid points centered about the point being predicted. However, the above design is already considerably complex and serves as a reasonable starting point for generalized multimodel regression.
The weights that minimize (8) can be found by setting to zero the derivative of rSSRR with respect to , which yields the following system of equations:
e9
where m = 1, 2, …, M and s = 1, 2, …, S. This system of equations constitutes MS-independent linear equations for the MS unknown quantities , which can be solved by direct matrix methods if MS is not prohibitively large. In particular, (9) can be written in the form = , with the identifications
e10
e11
In this notation, the optimal weights are
e12
provided is nonsingular. As a simple example, in the case Ω = , the solution becomes
e13
This solution, which is equivalent to that proposed by van den Dool and Rukhovets (1994), will be called pointwise ridge regression.

An attractive property of linear regression is that the associated predictions are invariant to linear transformations of the predictors. This property does not hold for ridge regression. Consequently, many texts on ridge regression recommend standardizing the predictors prior to applying ridge regression. We have compared both approaches and have found that regressions based on standardized forecasts have higher skill than those based on unstandardized forecasts. Accordingly, we present results only for standardized forecasts.

The above solution requires inverting an MS × MS dimensional matrix. Thus, even for moderate resolution (e.g., 2.5° × 2.5° grid implies S ≈ 10 000), the matrix can be prohibitively large to invert. This numerical limitation should not be construed as fundamental. For instance, conjugate gradient methods probably would be able to solve these equations more efficiently without creating the matrix . Also, the matrix is of a symmetric block Toeplitz form, suggesting that special inversion methods could be brought to bear.

b. Specifying the constraint

The matrix Ω specifies the constraint for ensuring that the weights vary smoothly in space. Such a constraint requires preferentially penalizing small-scale structure while weakly penalizing large-scale structure. One approach is to define the penalty function based on the following measure of inverse spatial scale, which holds for a smooth function w with homogeneous boundary conditions:
e14
where the integral is taken over the area of the domain. If w is a spherical harmonic then C becomes the corresponding total wavenumber. These considerations naturally suggest identifying with the finite difference representation of the operator −∇2.
The above approach becomes problematic if the domain is not continuous, especially if the domain consists of isolated “islands” of grid cells disconnected with other grid cells, as commonly happens with land-only analyses, for then the finite difference derivative cannot be evaluated. We propose an alternative approach that avoids this problem. Specifically, (14) shows that the quantity C can be represented in two equivalent ways. The last expression in (14) equals the sum square difference between nearest neighbors of w divided by distance. We propose generalizing this term by considering all possible combination of points, in which case the scale measure becomes
e15
where is the great circle distance between the points s and s′ and p is an exponent that controls the degree to which small-scale structure is penalized relative to large-scale structure. It is evident that Cp is a nonnegative quantity and vanishes if and only if . Furthermore, a difference in weights between two nearby points is penalized more strongly than the same difference between two far away points, owing to the distance function . Of course any power of distance could be used for to achieve the same effect, so the penalty function proposed here is not claimed to be unique. Finally, the existence of islands of grid cells presents no special problems in the evaluation of the above scale measure.
To include the penalty function (15), the penalty function must be written as a quadratic form. The corresponding matrix that yields this penalty function is
e16
This matrix has the property that the sum of any row or column vanishes, which implies that this penalty function does not constrain the spatial mean of the weights. For completeness, we considered the more general cost function
e17
where rSSRR is the original cost function (8) and the last term in (17) is an additional penalty term on the deviation of the weights from 1/M. In essence, the penalty parameter λ controls the spatial gradients of the weights while λ2 controls the constant value about which the weights vary. However, we found in all cases λ2 = 0 gave the best cross-validated skill, so we do not discuss results based on the cost function (17) but rather consider only (8).
For comparison with pointwise ridge regression, it turns out to be convenient to normalize by the value along the equator, denoted . This normalization effectively rescales λ such that it has the same meaning along the equator for both the pointwise and scale-selective ridge, allowing the skill to be shown as a function of λ on the same graph for both pointwise and scale-selective ridge regressions. Accordingly, the final matrix Ω is defined as
e18
To demonstrate that the function Cp penalizes small-scale variability, we evaluate the ratio wTΩw/wTw for p = 2 for weights w equal to a spherical harmonic. Spherical harmonics are the eigenfunctions of the Laplacian operator, and the corresponding total wavenumber, given by the eigenvalues, provide a natural measure of “spatial scale.” The value of Cp as a function of total wavenumber for the spherical harmonics appropriate to the domain used in this paper is shown in Fig. 1. Recall that for each total wavenumber N there are 2N + 1 spherical harmonics. The figure shows that Cp tends to increase with total wavenumber, demonstrating that the penalty function is larger for fields with smaller-scale variability.
Fig. 1.
Fig. 1.

The penalty function Cp, for p = 2, evaluated for the gravest spherical harmonics as a function of total wavenumber.

Citation: Journal of Climate 26, 20; 10.1175/JCLI-D-13-00030.1

c. Selecting the penalty parameter

An outstanding question in the above approach is how to choose the penalty parameter λ. A standard approach is to select penalty parameters to minimize the cross-validated error of the regression model. In leave-one-out cross validation, one sample of the dataset is withheld and the remaining samples are used to fit the model. Then, the resulting model is used to predict the withheld sample. This procedure is repeated using a different withheld sample in turn until all samples have been withheld exactly once.

Selecting the penalty parameter to maximize cross-validated skill leads to artificially inflated estimates of out-of-sample prediction skill. The reason for this selection bias is that random variations in cross-validated skill can be mistaken for real differences in skill. A standard way to avoid this bias is to perform nested cross validation (DelSole 2007). However, this procedure is computationally intensive and obscures interpretation since no single model and penalty parameter is being tested. On the other hand, our primary goal is to compare different multimodel strategies. Accordingly, we explore a range of penalty parameters. Furthermore, when specific results are illustrated, we select the penalty parameter that maximizes the cross-validated skill for each strategy separately. Even though the cross-validated skill may be biased and there may be considerable uncertainty in the choice in penalty parameter, the comparison at least shows how well each strategy performs in a best case cross-validation scenario.

The question arises as to how to measure the prediction error of the cross-validated forecasts. Unfortunately, there is no unique measure that characterizes prediction error in both space and time. We specifically avoid using measures based on correlation for two reasons: correlation gives a misleading impression of skill when trends are present in the data, and correlations are not additive even for independent events. Instead, we use the squared error skill score (SESS):
e19
where is the (cross validated) prediction of , and is the time mean of . Because observations are standardized to unit variance, SESSs can be written equivalently as
e20
To summarize the skill over the globe, we use the area weighted mean SESSs:
e21
where as is the fractional area of the sth grid cell.

4. Data

The dataset used to test the proposed weighting scheme is the seasonal hindcast dataset from the Ensemble-Based Predictions of Climate Changes and Their Impacts (ENSEMBLES) project. This dataset, reviewed by Weisheimer et al. (2009), consists of 7-month hindcasts by five state-of-the-art coupled atmosphere–ocean general circulation models (AOGCMs) from the Met Office, Météo France, European Centre for Medium-Range Weather Forecasts, Leibniz Institute of Marine Sciences at Kiel University, and Euro-Mediterranean Centre for Climate Change in Bologna. All models include major radiative forcings and were initialized using realistic estimates from observations. Hindcasts initialized on the first of February, May, August, and November of each year in the 46-yr period 1960–2005 were examined, but we show results only for November initial conditions. Each model produced a nine-member ensemble hindcast. These were averaged to construct an ensemble-mean forecast for each model. Considering only ensemble-mean forecasts is justified when only the linear combination of forecasts are considered (DelSole 2007). Further details of this data can be found in Weisheimer et al. (2009).

The variables considered in this study are 2-m surface temperature. These variables were interpolated onto a common 10° × 10° grid. A relatively coarse grid is used to facilitate numerical solution of the scale-selective ridge regression. We examine the 3-month mean hindcasts for November–January (NDJ), initialized in November, primarily because the El Niño–Southern Oscillation (ENSO) signal is expected to be largest during boreal winter. The hindcasts and verifications were centered by subtracting the respective grand means.

The observation-based surface temperature dataset used for verifying the 2-m temperature hindcasts is the National Centers for Environmental Prediction–National Center for Atmospheric Research (NCEP–NCAR) reanalysis (Kistler et al. 2001).

5. Results

First, we illustrate some aspects of ordinary least squares regression. As a representative example, we consider hindcasts of NDJ 2-m temperature by five AOGCMs initialized in November. The weights derived from OLS for an arbitrarily selected model are shown in Fig. 2. The weights have numerous negative values and vary significantly on small scales. For instance, some grid points have large positive values juxtaposed to large negative values. To measure the skill of the OLS regression, we perform leave-one-out cross validation. The skill of the OLS regression, calculated from (20), is shown in the top left panel of Fig. 3. The figure shows that the multimodel forecast has negative skill in many regions in the midlatitudes. For comparison, we consider the scaled multimodel mean regression
e22
where wSMMM is a single weight determined separately at each point by least squares. The cross-validated skill of this regression model, shown in the top right panel of Fig. 3, has comparable or greater skill than the OLS regression almost everywhere and has fewer negative skill values. Thus, the scaled multimodel mean regression, which involves a single tunable parameter, appears to perform better than OLS regression. Moreover, the corresponding weights, shown in the top right panel of Fig. 2, has much less spatial variability.
Fig. 2.
Fig. 2.

Weights for the Action de Recherche Petite Echelle Grande Echelle (ARPEGE) model derived from ordinary least squares, scaled multimodel mean, pointwise ridge regression, and scale-selective ridge regression, for predicting NDJ 2-m temperature using the ENSEMBLES hindcasts during 1960–2005. The hindcasts were initialized in November. The power parameter for the scale-selective ridge is p = 2.

Citation: Journal of Climate 26, 20; 10.1175/JCLI-D-13-00030.1

Fig. 3.
Fig. 3.

Squared error skill score of cross-validated forecasts from the ordinary least squares regression, scaled multimodel mean model, pointwise ridge regression, and scale-selective regression (p = 2) for predicting NDJ 2-m temperature. The multimodel ensemble consists of five coupled AOGCMs initialized in November during 1960–2005.

Citation: Journal of Climate 26, 20; 10.1175/JCLI-D-13-00030.1

The above results illustrate classic symptoms of overfitting—that is, fitting variability that is not reproducible in independent samples (i.e., fitting the noise). Also, collinearity is a potential problem because the forecasts are correlated in ENSO-dominated regions. When overfitting and collinearity occur, the weights become unstable and the skill of the model performs worse compared to models with fewer parameters. The results also show degeneracy of the type discussed by Barnston and van den Dool (1993), whereby areas with inherently little skill produce negative skill estimates in a leave-one-out cross-validation scheme.

To deal with overfitting and collinearity, we consider several models, including pointwise ridge regression with weights derived from (13) and the scale-selective ridge with weights derived from (12) using the penalty function (18). The cross-validated skill of each model as a function of the ridge parameter λ is shown in Fig. 4. For the scale-selective ridge, we show results only for the power p = 2 [see (15)], as other integer powers had smaller skill. The figure shows that λ = 50 and 10 maximize the skill for scale-selective and pointwise ridge regression, respectively. Recall that λ = 0 corresponds to OLS regression, and λ = ∞ corresponds to the case of a single weight for all models and grid points. Also shown is the skill of a multimodel regression with weights dependent on space but not model (circle) and dependent on model but not space (diagonal cross). The latter two regressions have skill comparable to the best ridge regressions.

Fig. 4.
Fig. 4.

The skill of multimodel hindcasts of NDJ 2-m temperature using scale-selective ridge regression (solid) and pointwise ridge regression (dashed) as a function of the ridge parameter λ. Also shown is the skill of regressions in which the weights depend on space but not model (circle) and depend on model but not space (diagonal cross). Skill is measured by the SESS. Ordinary least squares regression corresponds to λ = 0, and regression using a single weight for all models and space points corresponds to λ = ∞. The power parameter for the scale-selective ridge is p = 2.

Citation: Journal of Climate 26, 20; 10.1175/JCLI-D-13-00030.1

In all cases, the scale-selective ridge has larger skill than the pointwise ridge (as reflected by the fact that the solid line lies above the dashed line for all λ in Fig. 4). Whether this difference in skill is statistically significant is difficult to test. We also see that the scaled multimodel mean (far left dot) typically performs as well as, if not better than, the pointwise ridge (as reflected by the fact that the dot on the left side is comparable to, or above, the dashed line). Note also that the skill of either ridge regression is greater than the skill of OLS (as reflected by the fact that the skill at λ = 0 is less than the maximum skill in Fig. 4). Thus, both versions of ridge regression appear to address collinearity and overfitting issues of OLS. The spatial dependence of skill for the two regressions is illustrated in Fig. 3. The most obvious difference is the reduction of negative skill values for the scale-selective ridge compared to other regressions, especially OLS.

The weights for an arbitrarily selected model derived from the scale-selective ridge and the pointwise ridge are shown in Fig. 2. As anticipated, the weights for the scale-selective ridge are smoother than those for OLS (shown in Fig. 2). In fact, the weights for the scale-selective ridge are nearly constant. The extreme case of a single weight for all models at all space points, corresponding to λ = ∞, has skill comparable to the best ridge regressions.

The above calculations were repeated for all available initial months, but the same conclusions hold. Specifically, in each case, the scale-selective ridge had larger cross-validated skill than the pointwise ridge, and regressions with weights that depend on space but not model, or vice versa, had skill comparable to the best ridge regressions. February initial conditions had the largest skill.

6. Summary and discussion

This paper proposed a new approach to linearly combining multimodel forecasts, called the scale-selective ridge, which ensures that the weighting coefficients satisfy certain smoothness constraints. This constraint is motivated by the fact that seasonally predictable patterns tend to be large scale. In the absence of a smoothness constraint, regression methods typically produce noisy weights and hence noisy predictions. Constraining the weights to be smooth ensures that the multimodel combination is no less smooth than the individual model forecasts. The weighting coefficients are estimated by a constrained least squares method, which is equivalent to minimizing a cost function comprising the familiar mean square error plus a penalty function that penalizes spatial gradients in the weights. The procedure requires specifying a parameter that controls the strength of the penalty function. The penalty parameter is chosen based on cross-validation experiments. The regression model reduces to pointwise ridge regression for a suitable choice of constraint.

Scale-selective ridge regression was tested with the ENSEMBLES hindcast dataset. Hindcasts initialized in November and validated over the average November–January period during 1960–2005 were examined. The resulting multimodel hindcasts were compared to those produced by pointwise ridge regression, as proposed by van den Dool and Rukhovets (1994). In the case of 2-m temperature, the weights derived from the scale-selective ridge are almost uniform in space. In fact, regressions in which the weights depend on model but not space, or depend on space but not model, had nearly the same skill as the scale-selective ridge. Nevertheless, scale-selective ridge regression yields greater aggregate skill than the pointwise ridge, although the significance of this difference is difficult to test.

The scale-selective ridge is computationally intensive, since the weight at any grid point depends on the data at all other grid points. It is likely that conjugate gradient minimization methods or matrix inversion methods that take advantage of the symmetric Toeplitz structure of the matrices may offer more efficient solutions. Furthermore, the advantages of the scale-selective ridge are likely to become more dramatic in very high resolution data, since pointwise methods are likely to be lead astray by the greater observational uncertainties at small scales and the poorer skill of numerically predicted fields at small scales.

Acknowledgments

This research was supported primarily by the National Oceanic and Atmospheric Administration, under the Climate Test Bed program (NA10OAR4310264). Additional support was provided by the National Science Foundation (ATM0332910, ATM0830062, and ATM0830068), National Aeronautics and Space Administration (NNG04GG46G and NNX09AN50G), the National Oceanic and Atmospheric Administration (NA04OAR4310034, NA09OAR4310058, NA05OAR4311004, NA10OAR4310210, and NA10OAR4310249), and Office of Naval Research (N00014-12-1-0911). The views expressed herein are those of the authors and do not necessarily reflect the views of these agencies.

REFERENCES

  • Barnston, A. G., , and H. M. van den Dool, 1993: A degeneracy in cross-validated skill in regression-based forecasts. J. Climate, 6, 963977.

    • Search Google Scholar
    • Export Citation
  • DelSole, T., 2007: A Bayesian framework for multimodel regression. J. Climate, 20, 28102826.

  • DelSole, T., , and J. Shukla, 2006: Specification of wintertime North America surface temperature. J. Climate, 19, 26912716.

  • Hastie, T., , R. Tibshirani, , and J. H. Friedman, 2003: Elements of Statistical Learning. Corrected ed. Springer, 552 pp.

  • Kistler, R., and Coauthors, 2001: The NCEP–NCAR 50-Year Reanalysis: Monthly means CD-ROM and documentation. Bull. Amer. Meteor. Soc., 82, 247267.

    • Search Google Scholar
    • Export Citation
  • Peña, M., , and H. van den Dool, 2008: Consolidation of multimodel forecasts by ridge regression: Application to Pacific surface temperature. J. Climate, 21, 65216538.

    • Search Google Scholar
    • Export Citation
  • Penland, C., , and P. D. Sardeshmukh, 1995: The optimal-growth of tropical sea-surface temperature anomalies. J. Climate, 8, 19992024.

  • Phelps, M. W., , A. Kumar, , and J. J. O'Brian, 2004: Potential predictability in the NCEP CPC dynamical seasonal forecast system. J. Climate, 17, 37753785.

    • Search Google Scholar
    • Export Citation
  • Press, W. H., , S. A. Teukolsky, , W. T. Vetterling, , and B. P. Fannerty, 1992: Numerical Recipes. Cambridge University Press, 693 pp.

  • Robertson, A. W., , U. Lall, , S. E. Zebiak, , and L. Goddard, 2004: Improved combination of multiple atmospheric GCM ensembles for seasonal prediction. Mon. Wea. Rev., 132, 27322744.

    • Search Google Scholar
    • Export Citation
  • van den Dool, H., , and L. Rukhovets, 1994: On the weights for an ensemble-averaged 6–10-day forecast. Wea. Forecasting, 9, 457465.

  • Weisheimer, A., and Coauthors, 2009: ENSEMBLES: A new multi-model ensemble for seasonal-to-annual prediction—Skill and progress beyond DEMETER forecasting tropical Pacific SSTs. Geophys. Res. Lett.,36, L21711, doi:10.1029/2009GL040896.

  • Yun, W. T., , L. Stefanova, , and T. N. Krishnamurti, 2003: Improvement of the multimodel superensemble technique for seasonal forecasts. J. Climate, 16, 38343840.

    • Search Google Scholar
    • Export Citation
  • Zwiers, F. W., 1987: A potential predictability study conducted with an atmospheric general circulation model. Mon. Wea. Rev., 115, 29572974.

    • Search Google Scholar
    • Export Citation
Save