1. Introduction
Several operational forecast centers issue predictions of weather and climate. The availability of multiple forecasts from different institutions raises the question of whether the forecasts can be combined to increase skill and reliability. Although several multimodel prediction systems have been proposed in the past, the relatively short datasets available for calibration lead to serious problems with overfitting. Consequently, a variety of approaches have been proposed to mitigate overfitting, including truncated singular value decomposition (SVD) analysis, ridge regression, and constrained least squares (Yun et al. 2003; van den Dool and Rukhovets 1994; DelSole 2007). These procedures often are applied on a point-by-point basis.
Recently, it has been recognized that seemingly different approaches to multimodel forecasting are actually special cases of a single Bayesian methodology, each distinguished by different prior assumptions on the model weights (DelSole 2007). This recognition has two important implications. First, it clarifies how each multimodel method fits within a unified framework. Second, it provides a mathematically rigorous and consistent methodology for incorporating prior information into the multimodel strategy. For instance, ridge regression can be interpreted as Bayesian regression under the prior assumption that the sum square weights are bounded; truncated SVD analysis can be interpreted as Bayesian regression under the prior assumption that the weights lie in the same space as the leading principal components of the predictors. Other reasonable priors, such as that the weights should be close to the multimodel mean, were suggested and implemented by DelSole (2007).
In this paper, we develop a multimodel regression based on a new prior assumption, namely that the forecast field should be a smooth function of space. This constraint is motivated by the fact that seasonally predictable patterns tend to be large scale. However, even if the forecast fields themselves are smooth, ordinary least squares regression applied to such forecasts typically generate noisy weights and therefore noisy predictions. Constraining the weights to be smooth ensures that the multimodel combination is no less smooth than the individual model forecasts. There is no unique way to account for this prior information. For instance, one could apply spatial smoothing to certain steps in a multimodel procedure (Robertson et al. 2004). However, intuitively, it would seem that weights need not be smoothed strongly where the regression fits the data well. Alternatively, one could pool neighboring points to estimate the weight at a single point, as proposed by Peña and van den Dool (2008). This approach is a local linear regression method with a constant kernel (Hastie et al. 2003, chapter 6). One could consider more general kernels that die off smoothly with distance from the target point but the kernel still would be constant, independent of goodness of fit. The purpose of this paper is to develop a new multimodel prediction system in which the weights are constrained to vary smoothly in space but with the degree of smoothness being balanced against goodness of fit.
2. Statement of the problem









The ordinary least squares solution often yields weights that vary considerably in space and perform relatively poorly in independent data, suggesting problems with overfitting (this will be illustrated in the results section). To mitigate overfitting, one might use as predictors only a small number of the leading principal components of the forecasts. This approach is formally equivalent to that proposed by Yun et al. (2003) and discussed in Press et al. (1986) based on the SVD. Another approach is to apply ridge regression (van den Dool and Rukhovets 1994). Both approaches involve empirical parameters: the SVD approach requires specifying the number of principal components, while ridge regression requires specifying the value of the ridge parameter. These approaches usually are applied at each grid point separately, with no explicit constraint on the spatial structure of the weights.
An important insight is that the above two methods are special cases of a more general Bayesian methodology (DelSole 2007). Specifically, both approaches arise when the errors are Gaussian and the weights are constrained: ridge regression arises when the weights are constrained to have prespecified L2 norm, while the principal component approach arises when certain combination of weights are constrained to vanish.
3. Proposed strategy
Although the constraints implicit in the above methods seem reasonable, other prior information may be worth taking into account. Specifically, numerous studies show that seasonally predictable structures tend to be large scale (Zwiers 1987; Penland and Sardeshmukh 1995; Phelps et al. 2004; DelSole and Shukla 2006). This fact suggests that the weights used to combine different model forecasts also should be large scale. Indeed, most experienced forecasters probably would reject a combination whose weights vary strongly in space. Also, numerical theory independently indicates that dynamical models are least reliable at the shortest length scales. Accordingly, rapidly varying weights designed to extract information from small-scale structure in the forecasts would seem to be of dubious value. For these and other reasons, it seems reasonable to assume that the weights should be large scale so that they do not introduce unpredictable small-scale noise. We would like to include this prior information into the regression.
A Bayesian framework offers the most general framework for incorporating prior information into regression. Unfortunately, this framework involves a cumbersome manipulation of probability distributions. The end result, however, is equivalent to solving a constrained least squares problem, which itself is intuitive and straightforward. Therefore, we describe our multimodel regression strategy in terms of a constrained least squares problem. The proposed strategy is a special case of Tikhonov regularization. For a suitable choice of constraint, the proposed solution reduces to standard ridge regression. However, we propose a constraint that suppresses small-scale variability in the weights more than large-scale variability. Accordingly we call the proposed approach scale-selective ridge regression.
a. Incorporating prior information from physical insight








An attractive property of linear regression is that the associated predictions are invariant to linear transformations of the predictors. This property does not hold for ridge regression. Consequently, many texts on ridge regression recommend standardizing the predictors prior to applying ridge regression. We have compared both approaches and have found that regressions based on standardized forecasts have higher skill than those based on unstandardized forecasts. Accordingly, we present results only for standardized forecasts.
The above solution requires inverting an MS × MS dimensional matrix. Thus, even for moderate resolution (e.g., 2.5° × 2.5° grid implies S ≈ 10 000), the matrix can be prohibitively large to invert. This numerical limitation should not be construed as fundamental. For instance, conjugate gradient methods probably would be able to solve these equations more efficiently without creating the matrix
b. Specifying the constraint








The penalty function Cp, for p = 2, evaluated for the gravest spherical harmonics as a function of total wavenumber.
Citation: Journal of Climate 26, 20; 10.1175/JCLI-D-13-00030.1
c. Selecting the penalty parameter
An outstanding question in the above approach is how to choose the penalty parameter λ. A standard approach is to select penalty parameters to minimize the cross-validated error of the regression model. In leave-one-out cross validation, one sample of the dataset is withheld and the remaining samples are used to fit the model. Then, the resulting model is used to predict the withheld sample. This procedure is repeated using a different withheld sample in turn until all samples have been withheld exactly once.
Selecting the penalty parameter to maximize cross-validated skill leads to artificially inflated estimates of out-of-sample prediction skill. The reason for this selection bias is that random variations in cross-validated skill can be mistaken for real differences in skill. A standard way to avoid this bias is to perform nested cross validation (DelSole 2007). However, this procedure is computationally intensive and obscures interpretation since no single model and penalty parameter is being tested. On the other hand, our primary goal is to compare different multimodel strategies. Accordingly, we explore a range of penalty parameters. Furthermore, when specific results are illustrated, we select the penalty parameter that maximizes the cross-validated skill for each strategy separately. Even though the cross-validated skill may be biased and there may be considerable uncertainty in the choice in penalty parameter, the comparison at least shows how well each strategy performs in a best case cross-validation scenario.




4. Data
The dataset used to test the proposed weighting scheme is the seasonal hindcast dataset from the Ensemble-Based Predictions of Climate Changes and Their Impacts (ENSEMBLES) project. This dataset, reviewed by Weisheimer et al. (2009), consists of 7-month hindcasts by five state-of-the-art coupled atmosphere–ocean general circulation models (AOGCMs) from the Met Office, Météo France, European Centre for Medium-Range Weather Forecasts, Leibniz Institute of Marine Sciences at Kiel University, and Euro-Mediterranean Centre for Climate Change in Bologna. All models include major radiative forcings and were initialized using realistic estimates from observations. Hindcasts initialized on the first of February, May, August, and November of each year in the 46-yr period 1960–2005 were examined, but we show results only for November initial conditions. Each model produced a nine-member ensemble hindcast. These were averaged to construct an ensemble-mean forecast for each model. Considering only ensemble-mean forecasts is justified when only the linear combination of forecasts are considered (DelSole 2007). Further details of this data can be found in Weisheimer et al. (2009).
The variables considered in this study are 2-m surface temperature. These variables were interpolated onto a common 10° × 10° grid. A relatively coarse grid is used to facilitate numerical solution of the scale-selective ridge regression. We examine the 3-month mean hindcasts for November–January (NDJ), initialized in November, primarily because the El Niño–Southern Oscillation (ENSO) signal is expected to be largest during boreal winter. The hindcasts and verifications were centered by subtracting the respective grand means.
The observation-based surface temperature dataset used for verifying the 2-m temperature hindcasts is the National Centers for Environmental Prediction–National Center for Atmospheric Research (NCEP–NCAR) reanalysis (Kistler et al. 2001).
5. Results
Weights for the Action de Recherche Petite Echelle Grande Echelle (ARPEGE) model derived from ordinary least squares, scaled multimodel mean, pointwise ridge regression, and scale-selective ridge regression, for predicting NDJ 2-m temperature using the ENSEMBLES hindcasts during 1960–2005. The hindcasts were initialized in November. The power parameter for the scale-selective ridge is p = 2.
Citation: Journal of Climate 26, 20; 10.1175/JCLI-D-13-00030.1
Squared error skill score of cross-validated forecasts from the ordinary least squares regression, scaled multimodel mean model, pointwise ridge regression, and scale-selective regression (p = 2) for predicting NDJ 2-m temperature. The multimodel ensemble consists of five coupled AOGCMs initialized in November during 1960–2005.
Citation: Journal of Climate 26, 20; 10.1175/JCLI-D-13-00030.1
The above results illustrate classic symptoms of overfitting—that is, fitting variability that is not reproducible in independent samples (i.e., fitting the noise). Also, collinearity is a potential problem because the forecasts are correlated in ENSO-dominated regions. When overfitting and collinearity occur, the weights become unstable and the skill of the model performs worse compared to models with fewer parameters. The results also show degeneracy of the type discussed by Barnston and van den Dool (1993), whereby areas with inherently little skill produce negative skill estimates in a leave-one-out cross-validation scheme.
To deal with overfitting and collinearity, we consider several models, including pointwise ridge regression with weights derived from (13) and the scale-selective ridge with weights derived from (12) using the penalty function (18). The cross-validated skill of each model as a function of the ridge parameter λ is shown in Fig. 4. For the scale-selective ridge, we show results only for the power p = 2 [see (15)], as other integer powers had smaller skill. The figure shows that λ = 50 and 10 maximize the skill for scale-selective and pointwise ridge regression, respectively. Recall that λ = 0 corresponds to OLS regression, and λ = ∞ corresponds to the case of a single weight for all models and grid points. Also shown is the skill of a multimodel regression with weights dependent on space but not model (circle) and dependent on model but not space (diagonal cross). The latter two regressions have skill comparable to the best ridge regressions.
The skill of multimodel hindcasts of NDJ 2-m temperature using scale-selective ridge regression (solid) and pointwise ridge regression (dashed) as a function of the ridge parameter λ. Also shown is the skill of regressions in which the weights depend on space but not model (circle) and depend on model but not space (diagonal cross). Skill is measured by the SESS. Ordinary least squares regression corresponds to λ = 0, and regression using a single weight for all models and space points corresponds to λ = ∞. The power parameter for the scale-selective ridge is p = 2.
Citation: Journal of Climate 26, 20; 10.1175/JCLI-D-13-00030.1
In all cases, the scale-selective ridge has larger skill than the pointwise ridge (as reflected by the fact that the solid line lies above the dashed line for all λ in Fig. 4). Whether this difference in skill is statistically significant is difficult to test. We also see that the scaled multimodel mean (far left dot) typically performs as well as, if not better than, the pointwise ridge (as reflected by the fact that the dot on the left side is comparable to, or above, the dashed line). Note also that the skill of either ridge regression is greater than the skill of OLS (as reflected by the fact that the skill at λ = 0 is less than the maximum skill in Fig. 4). Thus, both versions of ridge regression appear to address collinearity and overfitting issues of OLS. The spatial dependence of skill for the two regressions is illustrated in Fig. 3. The most obvious difference is the reduction of negative skill values for the scale-selective ridge compared to other regressions, especially OLS.
The weights for an arbitrarily selected model derived from the scale-selective ridge and the pointwise ridge are shown in Fig. 2. As anticipated, the weights for the scale-selective ridge are smoother than those for OLS (shown in Fig. 2). In fact, the weights for the scale-selective ridge are nearly constant. The extreme case of a single weight for all models at all space points, corresponding to λ = ∞, has skill comparable to the best ridge regressions.
The above calculations were repeated for all available initial months, but the same conclusions hold. Specifically, in each case, the scale-selective ridge had larger cross-validated skill than the pointwise ridge, and regressions with weights that depend on space but not model, or vice versa, had skill comparable to the best ridge regressions. February initial conditions had the largest skill.
6. Summary and discussion
This paper proposed a new approach to linearly combining multimodel forecasts, called the scale-selective ridge, which ensures that the weighting coefficients satisfy certain smoothness constraints. This constraint is motivated by the fact that seasonally predictable patterns tend to be large scale. In the absence of a smoothness constraint, regression methods typically produce noisy weights and hence noisy predictions. Constraining the weights to be smooth ensures that the multimodel combination is no less smooth than the individual model forecasts. The weighting coefficients are estimated by a constrained least squares method, which is equivalent to minimizing a cost function comprising the familiar mean square error plus a penalty function that penalizes spatial gradients in the weights. The procedure requires specifying a parameter that controls the strength of the penalty function. The penalty parameter is chosen based on cross-validation experiments. The regression model reduces to pointwise ridge regression for a suitable choice of constraint.
Scale-selective ridge regression was tested with the ENSEMBLES hindcast dataset. Hindcasts initialized in November and validated over the average November–January period during 1960–2005 were examined. The resulting multimodel hindcasts were compared to those produced by pointwise ridge regression, as proposed by van den Dool and Rukhovets (1994). In the case of 2-m temperature, the weights derived from the scale-selective ridge are almost uniform in space. In fact, regressions in which the weights depend on model but not space, or depend on space but not model, had nearly the same skill as the scale-selective ridge. Nevertheless, scale-selective ridge regression yields greater aggregate skill than the pointwise ridge, although the significance of this difference is difficult to test.
The scale-selective ridge is computationally intensive, since the weight at any grid point depends on the data at all other grid points. It is likely that conjugate gradient minimization methods or matrix inversion methods that take advantage of the symmetric Toeplitz structure of the matrices may offer more efficient solutions. Furthermore, the advantages of the scale-selective ridge are likely to become more dramatic in very high resolution data, since pointwise methods are likely to be lead astray by the greater observational uncertainties at small scales and the poorer skill of numerically predicted fields at small scales.
Acknowledgments
This research was supported primarily by the National Oceanic and Atmospheric Administration, under the Climate Test Bed program (NA10OAR4310264). Additional support was provided by the National Science Foundation (ATM0332910, ATM0830062, and ATM0830068), National Aeronautics and Space Administration (NNG04GG46G and NNX09AN50G), the National Oceanic and Atmospheric Administration (NA04OAR4310034, NA09OAR4310058, NA05OAR4311004, NA10OAR4310210, and NA10OAR4310249), and Office of Naval Research (N00014-12-1-0911). The views expressed herein are those of the authors and do not necessarily reflect the views of these agencies.
REFERENCES
Barnston, A. G., and H. M. van den Dool, 1993: A degeneracy in cross-validated skill in regression-based forecasts. J. Climate, 6, 963–977.
DelSole, T., 2007: A Bayesian framework for multimodel regression. J. Climate, 20, 2810–2826.
DelSole, T., and J. Shukla, 2006: Specification of wintertime North America surface temperature. J. Climate, 19, 2691–2716.
Hastie, T., R. Tibshirani, and J. H. Friedman, 2003: Elements of Statistical Learning. Corrected ed. Springer, 552 pp.
Kistler, R., and Coauthors, 2001: The NCEP–NCAR 50-Year Reanalysis: Monthly means CD-ROM and documentation. Bull. Amer. Meteor. Soc., 82, 247–267.
Peña, M., and H. van den Dool, 2008: Consolidation of multimodel forecasts by ridge regression: Application to Pacific surface temperature. J. Climate, 21, 6521–6538.
Penland, C., and P. D. Sardeshmukh, 1995: The optimal-growth of tropical sea-surface temperature anomalies. J. Climate, 8, 1999–2024.
Phelps, M. W., A. Kumar, and J. J. O'Brian, 2004: Potential predictability in the NCEP CPC dynamical seasonal forecast system. J. Climate, 17, 3775–3785.
Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Fannerty, 1992: Numerical Recipes. Cambridge University Press, 693 pp.
Robertson, A. W., U. Lall, S. E. Zebiak, and L. Goddard, 2004: Improved combination of multiple atmospheric GCM ensembles for seasonal prediction. Mon. Wea. Rev., 132, 2732–2744.
van den Dool, H., and L. Rukhovets, 1994: On the weights for an ensemble-averaged 6–10-day forecast. Wea. Forecasting, 9, 457–465.
Weisheimer, A., and Coauthors, 2009: ENSEMBLES: A new multi-model ensemble for seasonal-to-annual prediction—Skill and progress beyond DEMETER forecasting tropical Pacific SSTs. Geophys. Res. Lett.,36, L21711, doi:10.1029/2009GL040896.
Yun, W. T., L. Stefanova, and T. N. Krishnamurti, 2003: Improvement of the multimodel superensemble technique for seasonal forecasts. J. Climate, 16, 3834–3840.
Zwiers, F. W., 1987: A potential predictability study conducted with an atmospheric general circulation model. Mon. Wea. Rev., 115, 2957–2974.