1. Introduction
Given the reality of a changing climate, the demand for reliable and accurate information on expected trends in temperature, precipitation, and other variables is continuously growing. Stakeholders and decision makers in politics, economics, and other societal entities ask for exact numbers on the climate conditions to be expected at specific locations by the middle or end of this century. This demand is contrasted by the cascade of uncertainties that are still inherent in any projection of future climate, ranging from uncertainties in future anthropogenic emissions of greenhouse gases and aerosols (“emission uncertainties”), to uncertainties in physical process understanding and model formulation [“model uncertainties;” e.g., Murphy et al. (2004); Stainforth et al. (2007)], and to uncertainties arising from natural fluctuations [“initial condition uncertainty;” e.g., Lucas-Picher et al. (2008)]. In practice, the quantification of emission uncertainties is typically circumvented by explicitly conditioning climate projections on a range of well-defined emission scenarios (e.g., Nakicenovic and Swart 2000). Initial condition uncertainty is often considered negligible on longer time scales but can, in principle, be sampled by ensemble approaches, as is commonly the case in weather and seasonal forecasting (e.g., Buizza 1997; Kalnay 2003). A pragmatic and well-accepted approach to addressing model uncertainty is given by the concept of multimodel combination (e.g., Tebaldi and Knutti 2007), which is the focus of this paper.
So far there is no consensus on what is the best method of combining the output of several climate models. The easiest approach to multimodel combination is to assign one vote to each model (“equal weighting”). Other more sophisticated approaches suggest that assigning different weights to the individual models, with the weights reflecting the respective skill levels of the models, or the confidence we put into them. Proposed metrics as a basis for model weights include the magnitude of observed systematic model biases during the control period (Giorgi and Mearns 2002, 2003; Tebaldi et al. 2005), observed trends (Greene et al. 2006; Hawkins and Sutton 2009; Boé et al. 2009), or composites of a larger number of model performance diagnostics (Murphy et al. 2004).
Given that, in seasonal forecasting, performance-based weighting schemes have been successfully implemented and have been demonstrated to improve the average prediction skill (e.g., Rajagopalan et al. 2002; Robertson et al. 2004; Stephenson et al. 2005; Weigel et al. 2008b), it may appear obvious that model weighting can also improve the projections in a climate change context and reduce the uncertainty range. However, the two projection contexts are not directly comparable. In seasonal forecasting, usually 20–40 yr of hindcasts are available, which mimic real forecasting situations and can thus serve as a data basis for deriving optimum model weights. Even though longer-term climate trends are not appropriately reproduced by seasonal predictions (Liniger et al. 2007), cross-validated verification studies indicate that the climate is nevertheless stationary enough for the time scale considered. Within the context of climate change projections, however, the time scale of the predictand is typically on the order of many decades, rather than a couple of months. This strongly limits the number of verification samples that could be used to directly quantify how good a model is in reproducing the climate response to changes in external forcing and, thus, to deriving appropriate weights. This situation is aggravated by the fact that existing observations have already been used to calibrate the models. Even more problematic, however, is that we do not know if those models that perform best during the control simulations of past or present climate are those that will perform best in the future. Parameterizations that work well now may become inappropriate in a warmer climate regime. Physical processes, such as carbon cycle feedbacks, which are small now, may become highly relevant as the climate changes (e.g., Frame et al. 2007). Given these fundamental problems, it is not surprising that many studies have found only a weak relation between present-day model performance and future projections (Räisänen 2007; Whetton et al. 2007; Jun et al. 2008; Knutti et al. 2010; Scherrer 2010), and only a slight persistence of model skill during the past century (Reifen and Toumi 2009). Finally, not even the question of which model performs best during the control simulations can be readily answered but, rather, depends strongly on the skill metric, variable, and region considered (e.g., Gleckler et al. 2008). In fact, given that all models have essentially zero weight relative to the real world, Stainforth et al. (2007) go a step further and claim that any attempts to assign weights are, by principle, futile. Whatever one’s personal stance on the issue of model weighting in a climate change context is, it seems that at present there is no consensus on how model weights should be obtained, nor is it clear that appropriate weights can be obtained at all with the data and methods at hand.
In this study, we want to shed light on the issue of model weighting from a different perspective, namely from the angle of the expected error of the final outcome. Applying a simple conceptual framework, we attempt to answer the following questions in generic terms: 1) How does simple (unweighted) multimodel combination improve the climate projections? 2) How can the climate projections be further improved by appropriate weights, assuming we knew them? 3) What would the consequences be, in terms of the projection error, if weights were applied that were not representative of true skill? Comparing the potential gains by optimum weighting with the potential losses by “false” weighting, we ultimately want to arrive at a conclusion as to whether or not the application of model weights can be recommended at all at the moment, given the aforementioned uncertainties.
The paper is structured as follows. Section 2 introduces the basis of our analysis, a conceptual framework of climate projections. In section 3, this framework is applied to analyze the expected errors of both optimally and inappropriately weighted multimodels, taking the skill of unweighted multimodels as a benchmark. The impacts of joint model errors and internal variability are estimated. The results are discussed in section 4, and conclusions are provided in section 5.
2. The conceptual framework
a. Basic assumptions

b. Interpretation of the error terms and uncertainties
The quantification of the uncertainties of the error terms νx, νM, and ϵM is a key challenge in the interpretation of climate projections. The uncertainties of νx and νM stem from the high sensitivity of the short-term evolution of the climate system to small perturbations in the initial state and can, in principle, be sampled by ensemble (Stott et al. 2000) or filtering (Hawkins and Sutton 2009) approaches. For simplicity, we assume that both νx and νM follow the same (not necessarily Gaussian) distribution with expectation 0 and standard deviation σν, with the understanding that real climate models can reveal considerable differences in their internal variability (Hawkins and Sutton 2009).
Conceptually much more difficult is the quantification of the uncertainty range of the model error ϵM. Some aspects of the parameter uncertainty may be quantifiable by creating ensembles with varying settings of model parameters (e.g., Allen and Ingram 2002; Murphy et al. 2004). In addition, some aspects of structural uncertainty may at least in principle be quantifiable by systematic experiments. However, given the enormous dimensionality of the uncertainty space, such experiments can at best provide only a first guess of the uncertainty range. Even more problematic is the quantification of the impacts due to limited physical process understanding, that is, the “unknown unknowns” of the climate system.
Unfortunately, the uncertainty characteristics of ϵM cannot be simply sampled in the sense of a robust verification. This is for two reasons: (i) the “sample size problem,” that is, the fact that the long time scales involved reduce our sample size of independent past observations, and (ii) the “out of sample problem,” that is, the fact that any conclusion drawn on the basis of past and present-day observations needs to be extrapolated to so far unexperienced climate conditions. Any uncertainty estimate of ϵM is therefore necessarily based on an array of unprovable assumptions and thus is inherently subjective—and volatile. The confidence we put into a climate model reflects our current state of information and belief, but may change as new information become available, or as different experts are in charge of quantifying the uncertainties (Webster 2003). In fact, in a climate change context there is no such thing as “the” uncertainty (Rougier 2007), and consequently it is very difficult to give a reproducible, unique, and objective estimate of expected future model performance. On the shorter time scales of weather and seasonal forecasting, model errors exist equally, but their effects can be empirically quantified by sampling the forecast error statistics over a sufficiently large set of independent verification data (e.g., Raftery et al. 2005; Doblas-Reyes et al. 2005; Weigel et al. 2009). In this way, an objective estimate of the forecast uncertainty and thus of model quality is possible; the confidence we put into the accuracy of a model projection is backed up by past measurements of model performance in comparable cases.
Thus, the central conceptual difference between the interpretation of short-range forecasts of weeks and seasons and long-range projections of climate change is in their different definitions of “uncertainty.” In the former, uncertainty is defined by long series of repeated and reproducible hindcast experiments and thus follows the relative frequentists’ or physical perception of uncertainty, in the sense of a measurable quantity. In the latter, uncertainty is partially subjective and depends on prior assumptions as well as expert opinion, thus following the Bayesian perception of uncertainty. It is for exactly this reason that the concept of model weighting, which requires a robust definition of model uncertainty, is relatively straightforward in short-range forecasting but so controversial on climate change time scales.
In the present study we want to analyze the consequences of “correct” and “false” weights on the accuracy of climate projections. However, a weight can only be called correct or false if the underlying uncertainties to be represented by the weights are well defined and uniquely determined. To circumvent this dilemma, we simply assume that enough data were available, or, as Smith (2002) and Stainforth et al. (2007) put it, that we had access to many universes so that the uncertainty range of ϵM can be fully sampled and defined in a relative frequentists’ sense; that is, we assume that enough information was available such that the relative frequentists’ and Bayesian interpretations of model uncertainty converge. This uncertainty, denoted by σM, is what we henceforth refer to as the true model uncertainty. We do not know how, or whether at all, the actual value of σM can be sampled in practice, but we assume that σM exists in the sense of a unique physical propensity as defined by Popper (1959). While this assumption may appear disputable, it is indispensable for a discussion on the effects of model weighting. Without the existence of a uniquely determined model error uncertainty, the task of defining optimum weights and thus the concept of model weighting in general would be ill-posed by principle.
Finally, we assume that (i) the noise and error terms νx, νM, and ϵM are statistically independent from each other and (ii) that not only νx and νM, but also ϵM, have expectation 0. Both assumptions may be too simplifying. The former assumption implies, among others, that the internal variability of a climate model is not affected by errors in model formulation. The latter assumption implies that, after removing the effects of internal variability, the expected mean bias of a model during the scenario period is the same as the observed mean bias during the control period (otherwise a nonzero ϵM would be expected). This assumption of “constant biases” has recently been questioned (e.g., Christensen et al. 2008; Buser et al. 2009). Nevertheless, probably for lack of better alternatives, these assumptions have been applied in most published climate projections (e.g., Solomon et al. 2007), and we will stick to them to keep the discussion as simple and transparent as possible.
c. Definition of skill

3. The effects of model combination and weights
In this section, we apply the conceptual framework of Eq. (1) to analyze how
a. Negligible noise, independent model errors








Figure 2 shows, as a function of r, the effects of model averaging with equal weights. Without a loss of generality, we only show and discuss r ≤ 1 (i.e., σM2 ≤ σM1). For the moment, we shall ignore the gray lines. Figure 2 shows the expected MSEs













b. The effect of joint model errors






How does all this then affect the expected MSEs of the multimodel outcome? Figure 5 shows, in analogy to Fig. 2, the expected squared errors
The following conclusions can be drawn from Fig. 5. As j increases, the net skill improvement of the multimodels with respect to the single models decreases, regardless of how the multimodel is constructed. This is plausible, since multimodels can only reduce the independent error components, whose magnitude decreases as j is increased. In relative terms,
In the last part of this section, we now consider the additional effects arising from unpredictable noise. For simplicity, we return to the assumption of independent model errors; that is, j = 0.
c. The effect of unpredictable noise




How does the presence of unpredictable noise then affect the quality of the multimodel projections? Figure 8 shows, in analogy to Figs. 2 and 5, the expected squared errors
The results can be summarized as follows. As R increases, the difference between










Remark 4: In this section we have assumed that the model errors are independent (i.e., that j = 0). However, also under the presence of noise, it is straightforward to generalize the projection context to the situation of j > 0, as in section 3b. In this case, the optimum multimodel mean would converge to (Δμ + ϵj) rather than Δμ, and the limit of the MSE would be (σj2 + σν2) rather than σν2. However, even more than in section 3b, the presence of joint errors has only minor implications on the relative performance of the weighted versus unweighted multimodels and will, therefore, not be further discussed here.
4. Discussion
As all results presented above are based on a simple conceptual framework, they are as such only valid to the degree that the underlying assumptions hold. Most likely, our most unrealistic assumption is that the emission uncertainty has been entirely ignored. In principle, emission uncertainty could be conceptually included in Eq. (1) by adding an emission scenario error term s to ΔyM, such that ΔyM = Δμ + ϵM + νM + s. However, in a multimodel ensemble, all contributing single models are typically subject to the same emission scenario assumptions and thus the same scenario error s. Therefore, the impacts of emission uncertainty on the relative performance of single models versus multimodels are probably very small. Rather, it is that the absolute projection accuracy would be heavily affected, in that both single-model and multimodel MSEs would be systematically offset by s2 with respect to the errors discussed in section 3. This of course has severe consequences for our interpretation of climate projections in general, but does not affect our discussion on model weights. Apart from the issue of emission uncertainty, the conceptual framework involves many more simplifying assumptions, such as the omission of interaction terms between the different uncertainty sources, as mentioned by Déqué et al. (2007). However, we believe that by having explicitly considered the effects of skill difference (via r), model error dependence (via j), and noise (via R), the conceptual framework, despite its simplicity, is realistic enough to allow some generally valid conclusions.
The least surprising conclusion to be drawn is probably that equally weighted multimodel combination on average improves the reliability of climate projections—a conclusion that is fully consistent with what is known from many verification studies in weather and seasonal forecasting (e.g., Hagedorn et al. 2005; Palmer et al. 2004; Weigel et al. 2008b). Regardless of which values for r, j, and R are chosen, the expected MSE of the multimodel is lower than the average MSE of the participating single models. Moreover, and again consistent with experience from shorter time scales, it has been shown that in principle model weighting can optimize the skill, if properly done. However, this requires an accurate knowledge of r—the key problem in the context of climate change.
Any estimate of r is to some degree necessarily based on the assessment of past and present model performance, and it needs to be extrapolated into the future to be of use for model weighting. In essence, the assumption must be made that r is stationary under a changing climate, which is problematic since other physical processes may become more relevant and dominant in the future than they are now (Knutti et al. 2010). This apprehension is backed by recent analyses of Christensen et al. (2008) and Buser et al. (2009), who have shown that systematic model errors are likely to change in a warming climate. However, even if r was stationary under a changing climate, we would still be confronted with the problem of how to determine a robust estimate of r on the basis of the available data. In contrast to, say, seasonal forecasting, the multidecadal time scale of the predictand strongly limits the number of independent verification samples that could be used to quantify r. This problem is aggravated by the fact that over the larger part of the past century, the anthropogenic climate change signal was relatively weak in comparison to the internal variability. Indeed, Kumar (2009) has shown that for small signal-to-noise ratios (on the order of 0.5) even 25 independent verification samples, a sample size which would actually be very large on multidecadal time scales, is hardly enough to obtain statistically robust skill estimates. Attempts have been made to try and circumvent this sampling issue by estimating model error uncertainties on the basis of other variables that can be verified more easily, such as systematic model biases (e.g., Giorgi and Mearns 2002). However, this leaves the question as to whether such “alternative” variables are representative for a model’s ability to quantify multidecadal climate change signals. Whetton et al. (2007), Jun et al. (2008), Knutti et al. (2010), and other studies, for example, show that the correlations between present-day model performance (in terms of such alternative variables) and future changes are in fact weak, and within the context of monthly forecasting, Weigel et al. (2008a) have shown that those areas with the best bias characteristics are not necessarily those areas with the highest monthly prediction skill.
Given all of these fundamental problems in quantifying r, it seems that at the moment there is no consensus on how robust model weights can be derived in the sense of Eq. (9)—apart from one exception: If we know a priori that a given model M1 cannot provide a meaningful estimate of future climate while another model M2 can (e.g., because M1 is known to lack important key mechanisms that are indispensable to providing correct climate projections, while M2 has them included), then it may be justifiable to assume that σM2 ≪ σM1 and thus r = 0. For small R, this would then correspond to removing M1 entirely from the multimodel ensemble. In fact, some studies have found more consistent projections when eliminating poor models (e.g., Walsh et al. 2008; Perkins and Pitman 2009; Scherrer 2010). In the general sense, however, model weights bear a high risk of not being representative of the underlying uncertainties. In fact, we believe that the possibility of inadvertently assigning nearly random weights as analyzed in section 3 is not just an academic play of thoughts, but rather a realistic scenario.
Under such conditions, the weighted multimodel yields on average larger errors than if the models had been combined in an equally weighted fashion. In fact, unless r and R are very small, the potential loss in projection accuracy by applying unrepresentative weights is on average even larger than the potential gain in accuracy by optimum weighting. Also this aspect finds its equivalent in the context of seasonal forecasting. In an analysis of 2-m temperature forecasts stemming from 40 yr of hindcast data of two seasonal prediction systems, Weigel et al. (2008b) have shown that the equally weighted combination of these two models yields on average higher skill than any of the two single models alone, and that the skill can be further improved if optimum weights are applied (the optimum weights have thereby been defined grid-point wise). However, if the amount of independent training data is systematically reduced, the weight estimates become more uncertain and the average prediction skill drops (see Table 1 for skill values). In fact, if the weights are obtained from less than 20 yr of hindcast data, weighted multimodel forecasts are outperformed by the equally weighted ones. Particularly low skill is obtained for random weights, as can be seen in Table 1. However, note that even the randomly weighted multimodel still outperforms both single models.
In summary, our results suggest that, within the context of climate change, model combination with equal rather than performance-based weights may well be the safer and more transparent strategy to obtain optimum results. These arguments are further strengthened if the magnitude of the noise becomes comparable to or even larger than the model error uncertainty; that is, if R ≳ 1. Under these conditions, the optimum weights have been shown to approach 0.5. This means, for large R, equal weighting essentially is the optimum way to weight the models (see Figs. 8b and 8c), at least if the models to be combined have comparable internal variability. Table 2 provides some rough estimates of R obtained from other studies found in the literature. While these studies are based on different methods and projection contexts, which can lead to considerably different estimates of R, they all show that R can indeed be large enough so that the application of model weights would only be of moderate use, even if the model error ratios were accurately known. This is particularly relevant if variables with low signal-to-noise ratios are considered (e.g., precipitation rather than temperature), if relatively small spatial and temporal aggregations are evaluated (e.g., a 10-yr average over central Europe rather than a 30-yr global average), if the lead times are comparatively short (e.g., 20 yr rather than 100 yr), and if no ensembles are available to sample the uncertainty in the initial conditions.
5. Conclusions
Multimodel combination is a pragmatic and well-accepted technique to estimate the range of uncertainties induced by model error and to improve the climate projections. The simplest way to construct a multimodel is to give one vote to each model, that is, to combine the models with equal weights. Since models differ in their quality and prediction skill, weighting the participating models according to their prior performance has been suggested, which is an approach that has been proven to be successful in weather and seasonal forecasting. In the present study, we have analyzed the prospects and risks of model weighting within the context of multidecadal climate change projections. It has been our aim to arrive at a conclusion as to whether or not the application of model weights can be recommended.
On shorter time scales, such an assessment can be carried out in the form of a statistically robust verification of the predictand of interest. For climate change projections, however, this is hardly possible due to the long time scales involved. Therefore, our study has been based on an idealized framework of climate change projections. This framework has been designed such that it allows us to assess, in generic terms, the effects of multimodel combination independently of the model error magnitudes, the degree of model error correlation, and the amount of unpredictable noise (internal variability). The key results, many of which are consistent with experience from seasonal forecasting, can be summarized as follows:
- Equally weighted multimodels yield, on average, more accurate projections than do the participating single models alone, at least if the skill difference between the single models is not too large.
- The projection errors can be further reduced by model weighting, at least in principle. The optimum weights are thereby not only a function of the single model error uncertainties, but also depend on the degree of model error correlation and the relative magnitude of the unpredictable noise. Neglecting the latter two aspects can lead to severely biased estimates of optimum weights. If model error correlation is neglected, the skill difference between the two models is underestimated; if internal variability is neglected, the skill difference is overestimated.
- Evidence from several studies suggests that the task of finding robust and representative weights for climate models is certainly a difficult problem. This is due to (i) the inconveniently long time scales considered, which strongly limit the number of available verification samples; (ii) nonstationarities of model skill under a changing climate; and (iii) the lack of convincing alternative ways to accurately determine skill.
- If model weights are applied that do not reflect the true model error uncertainties, then the weighted multimodel may have much lower skill than the unweighted one. In many cases, more information may actually be lost by inappropriate weighting than can potentially be gained by optimum weighting.
- This asymmetry between potential loss due to inappropriate weights and potential gain due to optimum weights grows under the influence of unpredictable noise. In fact, if the noise is of comparable or even larger magnitude than the model errors, then equal weighting essentially becomes the optimum way to construct a multimodel, at least if the models to be combined have similar internal variability. In practice, this is particularly relevant if variables with low signal-to-noise ratios are considered (e.g., precipitation rather than temperature), if high spatial and temporal detail is required, if the lead times are short, and if no ensemble members are available to sample the uncertainty of the initial conditions.
These results do not imply that the derivation of performance-based weights is impossible by principle. In fact, near-term (decadal) climate predictions, such as those planned for the Intergovernmental Panel on Climate Change’s (IPCC) fifth assessment report (Meehl et al. 2009), may contribute significantly to this objective in that they can serve as a valuable test bed for assessing projection uncertainties and characterizing model performance. Moreover, also within the presented framework eliminating models from an ensemble can be justified if they are known to lack key mechanisms that are indispensable for meaningful climate projections. However, our results do imply that a decision to weight the climate models should be made with the greatest care. Unless there is a clear relation between what we observe and what we predict, the risk of reducing the projection accuracy by inappropriate weights appears to be higher than the prospect of improving it by optimum weights. Given the current difficulties in determining reliable weights, for many applications equal weighing may well be the safer and more transparent way to proceed.
Having said that, the construction of equally weighted multimodels is not trivial, either. In fact, many climate models share basic structural assumptions, process uncertainties, numerical schemes, and data sources, implying that with a simple “each model one vote” strategy truly equal weights cannot be accomplished. An even higher level of complexity is reached when climate projections are combined that stem from multiple GCM-driven regional climate models (RCMs). Very often in such a downscaled scenario context, some of the available RCMs have been driven by the same GCM, while others have been driven by different GCMs (e.g., Van der Linden and Mitchell 2009). Assigning one vote to each model chain may then result in some of the GCMs receiving more weight than others, depending on how many RCMs have been driven by the same GCM.
Given these problems and challenges, model combination with equal weights cannot be considered to be a final solution, either, but rather a starting point for further discussion and research.
This study was supported by the Swiss National Science Foundation through the National Centre for Competence in Research (NCCR) Climate and by the ENSEMBLES project (EU FP6, Contract GOCE-CT-2003-505539). Helpful comments of Andreas Fischer are acknowledged.
REFERENCES
Allen, M. R., , and W. J. Ingram, 2002: Constraints on future changes in climate and the hydrological cycle. Nature, 419 , 224–232.
Boé, J., , A. Hall, , and X. Qu, 2009: September sea-ice cover in the Arctic Ocean projected to vanish by 2100. Nat. Geosci., 2 , 341–343. doi:10.1038/NGEO467.
Buizza, R., 1997: Potential forecast skill of ensemble prediction, and spread and skill distributions of the ECMWF Ensemble Prediction System. Mon. Wea. Rev., 125 , 99–119.
Buser, C. M., , H. R. Künsch, , D. Lüthi, , M. Wild, , and C. Schär, 2009: Bayesian multimodel projection of climate: Bias assumptions and interannual variability. Climate Dyn., 33 , 849–868. doi:10.1007/s00382-009-0588-6.
Christensen, J. H., , F. Boberg, , O. B. Christensen, , and P. Lucas-Picher, 2008: On the need for bias correction of regional climate change projections of temperature and precipitation. Geophys. Res. Lett., 35 , L20709. doi:10.1029/2008GL035694.
Cox, P., , and D. Stephenson, 2007: A changing climate for prediction. Science, 317 , 207–208.
Déqué, M., and Coauthors, 2007: An intercomparison of regional climate simulations for Europe: Assessing uncertainties in model projections. Climatic Change, 81 , 53–70.
Doblas-Reyes, F. J., , R. Hagedorn, , and T. N. Palmer, 2005: The rationale behind the success of multimodel ensembles in seasonal forecasting. Part II: Calibration and combination. Tellus, 57A , 234–252.
Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor., 8 , 985–987.
Frame, D. J., , N. E. Faull, , M. M. Joshi, , and M. R. Allen, 2007: Probabilistic climate forecasts and inductive problems. Phil. Trans. Roy. Soc., 365A , 1971–1992.
Giorgi, F., , and L. O. Mearns, 2002: Calculation of average, uncertainty range, and reliability of regional climate changes from AOGCM simulations via the “reliability ensemble averaging” (REA) method. J. Climate, 15 , 1141–1158.
Giorgi, F., , and L. O. Mearns, 2003: Probability of regional climate change based on the reliability ensemble averaging (REA) method. Geophys. Res. Lett., 30 , 1629. doi:10.1029/2003GL017130.
Gleckler, P. J., , K. E. Taylor, , and C. Doutriaux, 2008: Performance metrics for climate models. J. Geophys. Res., 113 , D06104. doi:10.1029/2007JD008972.
Greene, A. M., , L. Goddard, , and U. Lall, 2006: Probabilistic multimodel regional temperature change projections. J. Climate, 19 , 4326–4343.
Hagedorn, R., , F. J. Doblas-Reyes, , and T. N. Palmer, 2005: The rationale behind the success of multimodel ensembles in seasonal forecasting. Part I: Basic concept. Tellus, 57A , 219–233.
Hawkins, E., , and R. Sutton, 2009: The potential to narrow uncertainty in regional climate predictions. Bull. Amer. Meteor. Soc., 90 , 1095–1107.
Hawkins, E., , and R. Sutton, 2010: The potential to narrow uncertainty of regional precipitation change. Climate Dyn., in press, doi:10.1007/s00382-010-0810-6.
Jun, M., , R. Knutti, , and D. W. Nychka, 2008: Spatial analysis to quantify numerical model bias and dependence: How many climate models are there? J. Amer. Stat. Assoc., 103 , 934–947.
Kalnay, E., 2003: Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press, 341 pp.
Kharin, V. V., , and F. W. Zwiers, 2003: Improved seasonal probability forecasts. J. Climate, 16 , 1684–1701.
Knutti, R., 2008: Should we believe model predictions of future climate change? Philos. Trans. Roy. Soc., 366A , 4647–4664.
Knutti, R., , R. Furrer, , C. Tebaldi, , and J. Cermak, 2010: Challenges in combining projections from multiple climate models. J. Climate, 23 , 2739–2758.
Kumar, A., 2009: Finite samples and uncertainty estimates for skill measures for seasonal prediction. Mon. Wea. Rev., 137 , 2622–2631.
Liniger, M. A., , H. Mathis, , C. Appenzeller, , and F. J. Doblas-Reyes, 2007: Realistic greenhouse gas forcing and seasonal forecasts. Geophys. Res. Lett., 34 , L04705. doi:10.1029/2006GL028335.
Lucas-Picher, P., , D. Caya, , R. de Elía, , and R. Laprise, 2008: Investigation of regional climate models’ internal variability with a ten-member ensemble of 10-year simulations over a large domain. Climate Dyn., 31 , 927–940.
Meehl, G. A., and Coauthors, 2009: Decadal prediction: Can it be skillful? Bull. Amer. Meteor. Soc., 90 , 1467–1485.
Murphy, J. M., , D. M. H. Sexton, , D. N. Barnett, , G. S. Jones, , M. J. Webb, , M. Collins, , and D. A. Stainforth, 2004: Quantification of modelling uncertainties in a large ensemble of climate change simulations. Nature, 430 , 768–772.
Nakicenovic, N., , and R. Swart, Eds. 2000: Special Report on Emissions Scenarios. A Special Report of Working Group III of the Intergovernmental Panel on Climate Change. Cambridge University Press, 599 pp.
Palmer, T. N., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal-to-Interannual Prediction (DEMETER). Bull. Amer. Meteor. Soc., 85 , 853–872.
Perkins, S. E., , and A. J. Pitman, 2009: Do weak AR4 model bias projections of future climate change over Australia? Climatic Change, 93 , 527–558.
Popper, K. R., 1959: The propensity interpretation of probability. Brit. J. Philos. Sci., 10 , 25–42.
Raftery, A. E., , T. Gneiting, , F. Balabdaoui, , and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles. Mon. Wea. Rev., 133 , 1155–1174.
Räisänen, J., 2007: How reliable are climate models? Tellus, 59A , 2–29.
Rajagopalan, B., , U. Lall, , and S. E. Zebiak, 2002: Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles. Mon. Wea. Rev., 130 , 1792–1811.
Reifen, C., , and R. Toumi, 2009: Climate projections: Past performance no guarantee of future skill? Geophys. Res. Lett., 36 , L13704. doi:10.1029/2009GL038082.
Robertson, A. W., , U. Lall, , S. E. Zebiak, , and L. Goddard, 2004: Improved combination of multiple atmospheric GCM ensembles for seasonal prediction. Mon. Wea. Rev., 132 , 2732–2744.
Rougier, J., 2007: Probabilistic inference for future climate using an ensemble of climate model evaluations. Climatic Change, 81 , 247–264.
Roulston, M. S., , and L. A. Smith, 2002: Evaluating probabilistic forecasts using information theory. Mon. Wea. Rev., 130 , 1653–1660.
Scherrer, S. C., 2010: Present-day interannual variability of surface climate in CMIP3 models and its relation to future warming. Int. J. Climatol., in press, doi:10.1002/joc.2170.
Smith, L. A., 2002: What might we learn from climate forecasts? Proc. Natl. Acad. Sci. USA, 99 , 2487–2492.
Solomon, S., , D. Qin, , M. Manning, , M. Marquis, , K. Averyt, , M. M. B. Tignor, , H. L. Miller Jr., , and Z. Chen, Eds. 2007: Climate Change 2007: The Physical Sciences Basis. Cambridge University Press, 996 pp.
Stainforth, D. A., , M. R. Allen, , E. R. Tredger, , and L. A. Smith, 2007: Confidence, uncertainty and decision-support relevance in climate predictions. Philos. Trans. Roy. Soc. London, 365A , 2145–2161.
Stephenson, D. B., , C. A. S. Coelho, , F. J. Doblas-Reyes, , and M. Balmaseda, 2005: Forecast assimilation: A unified framework for the combination of multimodel weather and climate predictions. Tellus, 57A , 253–264.
Stott, P. A., , S. F. B. Tett, , G. S. Jones, , M. R. Allen, , J. F. B. Mitchell, , and G. J. Jenkins, 2000: External control of 20th century temperature by natural and anthropogenic forcings. Science, 290 , 2133–2137.
Tebaldi, C., , and R. Knutti, 2007: The use of the multimodel ensemble in probabilistic climate projections. Philos. Trans. Roy. Soc., 365A , 2053–2075.
Tebaldi, C., , R. L. Smith, , D. Nychka, , and L. O. Mearns, 2005: Quantifying uncertainty in projections of regional climate change: A Bayesian approach to the analysis of multimodel ensembles. J. Climate, 18 , 1524–1540.
Uppala, S. M., and Coauthors, 2005: The ERA-40 Re-Analysis. Quart. J. Roy. Meteor. Soc., 131 , 2961–3012.
Van der Linden, P., , and J. F. B. Mitchell, Eds. 2009: ENSEMBLES: Climate change and its impacts at seasonal, decadal and centennial timescales. Summary of research and results from the ENSEMBLES project. Met Office Hadley Centre, 160 pp. [Available from Met Office Hadley Centre, FitzRoy Road, Exeter AU3 EX1 3PB, United Kingdom].
Walsh, J. E., , W. L. Chapman, , V. Romanovsky, , J. H. Christensen, , and M. Stendel, 2008: Global climate model performance over Alaska and Greenland. J. Climate, 21 , 6156–6174.
Webster, M. D., 2003: Communicating climate change uncertainty to policy-makers and the public. Climatic Change, 61 , 1–8.
Weigel, A. P., , M. A. Liniger, , and C. Appenzeller, 2007: Generalization of the discrete Brier and ranked probability skill scores for weighted multimodel ensemble forecasts. Mon. Wea. Rev., 135 , 2778–2785.
Weigel, A. P., , D. Baggenstos, , M. A. Liniger, , F. Vitart, , and C. Appenzeller, 2008a: Probabilistic verification of monthly temperature forecasts. Mon. Wea. Rev., 136 , 5162–5182.
Weigel, A. P., , M. A. Liniger, , and C. Appenzeller, 2008b: Can multimodel combination really enhance the prediction skill of ensemble forecasts? Quart. J. Roy. Meteor. Soc., 134 , 241–260.
Weigel, A. P., , M. A. Liniger, , and C. Appenzeller, 2009: Seasonal ensemble forecasts: Are recalibrated single models better than multimodels? Mon. Wea. Rev., 137 , 1460–1479.
Whetton, P., , I. Macadam, , J. Bathols, , and J. O’Grady, 2007: Assessment of the use of current climate patterns to evaluate regional enhanced greenhouse response patterns of climate models. Geophys. Res. Lett., 34 , L14701. doi:10.1029/2007GL030025.

The conceptual framework of climate change projections for the assumptions applied in section 3a. Here, Δx is the true climate change signal to be observed in response to a prescribed external forcing, and ΔyM1 and ΔyM2 are two climate change projections obtained from climate models M1 and M2 in response to the same forcing. The deviation of ΔyM1 (ΔyM2) from Δx is assumed to be exclusively due to a model error ϵM1 (ϵM2) with model error uncertainty σM1 (σM2). The two model error terms are statistically independent from each other. The ratio r = σM2/σM1 is referred to as the model error ratio.
Citation: Journal of Climate 23, 15; 10.1175/2010JCLI3594.1

Expected squared errors of single and multimodels as a function of the model error ratio r for the assumptions applied in section 3a and Fig. 1. Shown are the errors for single model M1 (=
Citation: Journal of Climate 23, 15; 10.1175/2010JCLI3594.1

The conceptual framework for climate change projections for the assumptions applied in section 3b. In contrast to Fig. 1, the deviation of the climate projection ΔyM1 (ΔyM2) from the observation Δx is now thought to be decomposable into two components: (i) an error term ϵj (uncertainty σj), which is jointly seen by both participating climate models M1 and M2, and (ii) a residual error term ϵ′M1 (respectively ϵ′M2). The residual errors are statistically independent from each other. The ratio j = σj/σM1 is referred to as the joint error fraction.
Citation: Journal of Climate 23, 15; 10.1175/2010JCLI3594.1

Optimum weights wopt for the case of dependent model errors and negligible noise as obtained from Eq. (16). The contour lines show wopt as a function of r (model error ratio) and j (joint error fraction). Here, j = 0 corresponds to the case of fully independent model errors. Forbidden combinations of r and j are shaded in gray (by construction j ≤ r needs to be satisfied).
Citation: Journal of Climate 23, 15; 10.1175/2010JCLI3594.1

As in Fig. 2 but for the assumptions applied in section 3b and Fig. 3: j = (a) 0.2, (b) 0.5, and (c) 0.7, with j being the joint error fraction. Additionally, the expected squared errors of a weighted multimodel with the weights being incorrectly determined from Eq. (9) rather than Eq. (16) are shown as triangles (simplistic weights). The top abscissa shows the true optimum weights wopt as obtained from Eq. (16).
Citation: Journal of Climate 23, 15; 10.1175/2010JCLI3594.1

The conceptual framework for climate change projections for the assumptions applied in section 3c. In contrast to Fig. 1, the observed climate change signal Δx is now thought to be decomposable into a model predictable signal Δμ and an unpredictable “noise” term νx. Similarly, the climate change projection ΔyM1 (ΔyM2) is assumed to be decomposable into the predictable signal Δμ, a model error term ϵM1 (ϵM2), and a random noise term νM1 (νM2). The noise and error terms are statistically independent from each other. All noise terms are assumed to be samples from a distribution with standard deviation σν. The ratio R = σν/σM1 is referred to as the relative noise ratio.
Citation: Journal of Climate 23, 15; 10.1175/2010JCLI3594.1

Optimum weights wopt under the influence of internal variability (“noise”) as obtained from Eq. (19). The contour lines show wopt as a function of r (model error ratio) and R (relative noise ratio). Here, R = 0 corresponds to the case of negligible noise.
Citation: Journal of Climate 23, 15; 10.1175/2010JCLI3594.1

As in Fig. 2, but for the assumptions applied in section 3c and Fig. 6: R = (a) 0.5, (b) 1, and (c) 2, with R being the relative noise ratio. Additionally, the expected squared errors of a weighted multimodel with the weights being wrongly determined from Eq. (9) rather than Eq. (19) are shown as triangles (simplistic weights). The top abscissa shows the true optimum weights wopt as obtained from Eq. (19).
Citation: Journal of Climate 23, 15; 10.1175/2010JCLI3594.1
Average global prediction skill of seasonal forecasts (June–August) of 2-m temperature with a lead time of 1 month, obtained from the Development of a European Multimodel Ensemble System for Seasonal-to-Interannual Prediction (DEMETER) database (Palmer et al. 2004) and verified against 40-yr European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis (ERA-40) data (Uppala et al. 2005) for the period 1960–2001. Skill is measured by the positively oriented ranked probability skill score (RPSS; Epstein 1969). The verification context is described in detail in Weigel et al. (2008b). Shown is the RPSS for ECMWF’s “System 2” (M1), for the Met Office’s “GloSea” (M2), and for multimodels (MM) constructed from M1 and M2 with (i) equal weights; (ii) with optimum weights obtained grid-point wise from 40, 20, and 10 yr of hindcast data by optimizing the ignorance score of Roulston and Smith (2002); and (iii) with random weights. Skill values are given in percent.

Selection of relative noise ratio values (R) as estimated from the literature. Note that different methodologies have been applied in the studies cited.
