## 1. Introduction

Given the reality of a changing climate, the demand for reliable and accurate information on expected trends in temperature, precipitation, and other variables is continuously growing. Stakeholders and decision makers in politics, economics, and other societal entities ask for exact numbers on the climate conditions to be expected at specific locations by the middle or end of this century. This demand is contrasted by the cascade of uncertainties that are still inherent in any projection of future climate, ranging from uncertainties in future anthropogenic emissions of greenhouse gases and aerosols (“emission uncertainties”), to uncertainties in physical process understanding and model formulation [“model uncertainties;” e.g., Murphy et al. (2004); Stainforth et al. (2007)], and to uncertainties arising from natural fluctuations [“initial condition uncertainty;” e.g., Lucas-Picher et al. (2008)]. In practice, the quantification of emission uncertainties is typically circumvented by explicitly conditioning climate projections on a range of well-defined emission scenarios (e.g., Nakicenovic and Swart 2000). Initial condition uncertainty is often considered negligible on longer time scales but can, in principle, be sampled by ensemble approaches, as is commonly the case in weather and seasonal forecasting (e.g., Buizza 1997; Kalnay 2003). A pragmatic and well-accepted approach to addressing model uncertainty is given by the concept of multimodel combination (e.g., Tebaldi and Knutti 2007), which is the focus of this paper.

So far there is no consensus on what is the best method of combining the output of several climate models. The easiest approach to multimodel combination is to assign one vote to each model (“equal weighting”). Other more sophisticated approaches suggest that assigning different weights to the individual models, with the weights reflecting the respective skill levels of the models, or the confidence we put into them. Proposed metrics as a basis for model weights include the magnitude of observed systematic model biases during the control period (Giorgi and Mearns 2002, 2003; Tebaldi et al. 2005), observed trends (Greene et al. 2006; Hawkins and Sutton 2009; Boé et al. 2009), or composites of a larger number of model performance diagnostics (Murphy et al. 2004).

Given that, in seasonal forecasting, performance-based weighting schemes have been successfully implemented and have been demonstrated to improve the average prediction skill (e.g., Rajagopalan et al. 2002; Robertson et al. 2004; Stephenson et al. 2005; Weigel et al. 2008b), it may appear obvious that model weighting can also improve the projections in a climate change context and reduce the uncertainty range. However, the two projection contexts are not directly comparable. In seasonal forecasting, usually 20–40 yr of hindcasts are available, which mimic real forecasting situations and can thus serve as a data basis for deriving optimum model weights. Even though longer-term climate trends are not appropriately reproduced by seasonal predictions (Liniger et al. 2007), cross-validated verification studies indicate that the climate is nevertheless stationary enough for the time scale considered. Within the context of climate change projections, however, the time scale of the predictand is typically on the order of many decades, rather than a couple of months. This strongly limits the number of verification samples that could be used to directly quantify how good a model is in reproducing the climate response to changes in external forcing and, thus, to deriving appropriate weights. This situation is aggravated by the fact that existing observations have already been used to calibrate the models. Even more problematic, however, is that we do not know if those models that perform best during the control simulations of past or present climate are those that will perform best in the future. Parameterizations that work well now may become inappropriate in a warmer climate regime. Physical processes, such as carbon cycle feedbacks, which are small now, may become highly relevant as the climate changes (e.g., Frame et al. 2007). Given these fundamental problems, it is not surprising that many studies have found only a weak relation between present-day model performance and future projections (Räisänen 2007; Whetton et al. 2007; Jun et al. 2008; Knutti et al. 2010; Scherrer 2010), and only a slight persistence of model skill during the past century (Reifen and Toumi 2009). Finally, not even the question of which model performs best during the control simulations can be readily answered but, rather, depends strongly on the skill metric, variable, and region considered (e.g., Gleckler et al. 2008). In fact, given that all models have essentially zero weight relative to the real world, Stainforth et al. (2007) go a step further and claim that any attempts to assign weights are, by principle, futile. Whatever one’s personal stance on the issue of model weighting in a climate change context is, it seems that at present there is no consensus on how model weights should be obtained, nor is it clear that appropriate weights can be obtained at all with the data and methods at hand.

In this study, we want to shed light on the issue of model weighting from a different perspective, namely from the angle of the expected error of the final outcome. Applying a simple conceptual framework, we attempt to answer the following questions in generic terms: 1) How does simple (unweighted) multimodel combination improve the climate projections? 2) How can the climate projections be further improved by appropriate weights, assuming we knew them? 3) What would the consequences be, in terms of the projection error, if weights were applied that were not representative of true skill? Comparing the potential gains by optimum weighting with the potential losses by “false” weighting, we ultimately want to arrive at a conclusion as to whether or not the application of model weights can be recommended at all at the moment, given the aforementioned uncertainties.

The paper is structured as follows. Section 2 introduces the basis of our analysis, a conceptual framework of climate projections. In section 3, this framework is applied to analyze the expected errors of both optimally and inappropriately weighted multimodels, taking the skill of unweighted multimodels as a benchmark. The impacts of joint model errors and internal variability are estimated. The results are discussed in section 4, and conclusions are provided in section 5.

## 2. The conceptual framework

### a. Basic assumptions

*x*, for example, a 30-yr average of surface temperature over a given region, and assume that

*x*will change by Δ

*x*over a specified time period (e.g., the coming 50 yr). We decompose Δ

*x*, the predictand, into the sum of a potentially predictable signal, Δ

*μ*, and an unpredictable “noise” term,

*ν*: Δ

_{x}*x*= Δ

*μ*+

*ν*. Thereby, Δ

_{x}*μ*can be thought of as the

*expected*response of the climate to a prescribed change in the external forcing (i.e., the expectation of a hypothetical

*perfect*model that is run many times from different initial conditions), while

*ν*represents the remaining fluctuations. Now, assume an

_{x}*imperfect*climate model

*M*is applied to obtain an estimate of Δ

*x*. Let Δ

*y*be this estimate, that is, the climate change signal predicted by

_{M}*M*under a prescribed change in external forcing. Assume that there is no scenario uncertainty, that is, that

*M*is subject to the same changes in external forcing as reality. Formally, Δ

*y*can then be decomposed into the sum of the predictable signal Δ

_{M}*μ*, a random noise term

*ν*(often referred to as internal variability), and a residual error term

_{M}*ϵ*. Thus, we haveHenceforth,

_{M}*ϵ*will be referred to as the model error and can be thought of as a conglomerate of (i) errors due to uncertainties in the model parameters applied to describe unresolvable small-scale physical processes (“parametric uncertainties”), (ii) errors arising from the fact that known processes are missing or inadequately approximated in the model formulation (“structural uncertainty”), and (iii) errors due to our limited understanding of relevant feedbacks and physical processes (“process uncertainties”). A more detailed characterization of these uncertainty terms has been provided by Knutti (2008), among others.

_{M}### b. Interpretation of the error terms and uncertainties

The quantification of the uncertainties of the error terms *ν _{x}*,

*ν*, and

_{M}*ϵ*is a key challenge in the interpretation of climate projections. The uncertainties of

_{M}*ν*and

_{x}*ν*stem from the high sensitivity of the short-term evolution of the climate system to small perturbations in the initial state and can, in principle, be sampled by ensemble (Stott et al. 2000) or filtering (Hawkins and Sutton 2009) approaches. For simplicity, we assume that both

_{M}*ν*and

_{x}*ν*follow the same (not necessarily Gaussian) distribution with expectation 0 and standard deviation

_{M}*σ*, with the understanding that real climate models can reveal considerable differences in their internal variability (Hawkins and Sutton 2009).

_{ν}Conceptually much more difficult is the quantification of the uncertainty range of the model error *ϵ _{M}*. Some aspects of the parameter uncertainty may be quantifiable by creating ensembles with varying settings of model parameters (e.g., Allen and Ingram 2002; Murphy et al. 2004). In addition, some aspects of structural uncertainty may at least in principle be quantifiable by systematic experiments. However, given the enormous dimensionality of the uncertainty space, such experiments can at best provide only a first guess of the uncertainty range. Even more problematic is the quantification of the impacts due to limited physical process understanding, that is, the “unknown unknowns” of the climate system.

Unfortunately, the uncertainty characteristics of *ϵ _{M}* cannot be simply sampled in the sense of a robust verification. This is for two reasons: (i) the “sample size problem,” that is, the fact that the long time scales involved reduce our sample size of independent past observations, and (ii) the “out of sample problem,” that is, the fact that any conclusion drawn on the basis of past and present-day observations needs to be extrapolated to so far unexperienced climate conditions. Any uncertainty estimate of

*ϵ*is therefore necessarily based on an array of unprovable assumptions and thus is inherently subjective—and volatile. The confidence we put into a climate model reflects our current state of information and belief, but may change as new information become available, or as different experts are in charge of quantifying the uncertainties (Webster 2003). In fact, in a climate change context there is no such thing as “the” uncertainty (Rougier 2007), and consequently it is very difficult to give a reproducible, unique, and objective estimate of expected future model performance. On the shorter time scales of weather and seasonal forecasting, model errors exist equally, but their effects can be empirically quantified by sampling the forecast error statistics over a sufficiently large set of independent verification data (e.g., Raftery et al. 2005; Doblas-Reyes et al. 2005; Weigel et al. 2009). In this way, an objective estimate of the forecast uncertainty and thus of model quality is possible; the confidence we put into the accuracy of a model projection is backed up by past measurements of model performance in comparable cases.

_{M}Thus, the central conceptual difference between the interpretation of short-range forecasts of weeks and seasons and long-range projections of climate change is in their different definitions of “uncertainty.” In the former, uncertainty is defined by long series of repeated and reproducible hindcast experiments and thus follows the relative frequentists’ or physical perception of uncertainty, in the sense of a measurable quantity. In the latter, uncertainty is partially subjective and depends on prior assumptions as well as expert opinion, thus following the Bayesian perception of uncertainty. It is for exactly this reason that the concept of model weighting, which requires a robust definition of model uncertainty, is relatively straightforward in short-range forecasting but so controversial on climate change time scales.

In the present study we want to analyze the consequences of “correct” and “false” weights on the accuracy of climate projections. However, a weight can only be called correct or false if the underlying uncertainties to be represented by the weights are well defined and uniquely determined. To circumvent this dilemma, we simply assume that enough data were available, or, as Smith (2002) and Stainforth et al. (2007) put it, that we had access to many universes so that the uncertainty range of *ϵ _{M}* can be fully sampled and defined in a relative frequentists’ sense; that is, we assume that enough information was available such that the relative frequentists’ and Bayesian interpretations of model uncertainty converge. This uncertainty, denoted by

*σ*, is what we henceforth refer to as the true model uncertainty. We do not know how, or whether at all, the actual value of

_{M}*σ*can be sampled in practice, but we assume that

_{M}*σ*in the sense of a unique physical

_{M}exists*propensity*as defined by Popper (1959). While this assumption may appear disputable, it is indispensable for a discussion on the effects of model weighting. Without the existence of a uniquely determined model error uncertainty, the task of defining optimum weights and thus the concept of model weighting in general would be ill-posed by principle.

Finally, we assume that (i) the noise and error terms *ν _{x}*,

*ν*, and

_{M}*ϵ*are statistically independent from each other and (ii) that not only

_{M}*ν*and

_{x}*ν*, but also

_{M}*ϵ*, have expectation 0. Both assumptions may be too simplifying. The former assumption implies, among others, that the internal variability of a climate model is not affected by errors in model formulation. The latter assumption implies that, after removing the effects of internal variability, the

_{M}*expected*mean bias of a model during the scenario period is the same as the

*observed*mean bias during the control period (otherwise a nonzero

*ϵ*would be expected). This assumption of “constant biases” has recently been questioned (e.g., Christensen et al. 2008; Buser et al. 2009). Nevertheless, probably for lack of better alternatives, these assumptions have been applied in most published climate projections (e.g., Solomon et al. 2007), and we will stick to them to keep the discussion as simple and transparent as possible.

_{M}### c. Definition of skill

*M*, we apply the expected mean squared error (MSE) between Δ

*y*and Δ

_{M}*x*, henceforth denoted by

*:The brackets 〈…〉 thereby denote the expectation. Since*

_{M}*σ*and

_{ν}*σ*are assumed to be uniquely determined,

_{M}*is well defined.*

_{M}## 3. The effects of model combination and weights

In this section, we apply the conceptual framework of Eq. (1) to analyze how * _{M}* is affected by the weighted and unweighted combinations of multiple model output. To keep the discussion as transparent as possible, we will restrict ourselves mainly to the combination of only two models. A generalization of the conclusions to more models will not be presented here, but is straightforward by mathematical induction, since the combination of any number of models can be decomposed into a sequence of dual combinations. We start our analysis with the simple and idealized case of fully independent model errors and negligible internal variability

*σ*= 0 (section 3a), then we discuss the case when the model errors are not independent (section 3b), and finally we analyze the consequences to be expected if

_{ν}*σ*is nonnegligible (section 3c).

_{ν}### a. Negligible noise, independent model errors

*ν*=

_{x}*ν*= 0). Under these conditions one has Δ

_{M}*μ*= Δ

*x*, implying that the true observable climate change signal Δ

*x*is in principle fully predictable. Assume that two climate models,

*M*1 and

*M*2, are applied and yield climate change projections Δ

*y*

_{M1}and Δ

*y*

_{M2}. Let

*ϵ*

_{M1}and

*ϵ*

_{M2}be the corresponding projection errors of

*M*1 and

*M*2 due to model uncertainty. From Eq. (1) it follows thatThis situation is illustrated in Fig. 1. Under these assumptions, the expected squared errors of Δ

*y*

_{M1}and Δ

*y*

_{M2}are given by

*y*

_{M1}and Δ

*y*

_{M2}with equal weights yields a simple multimodel projection

*ϵ*

_{M1}and

*ϵ*

_{M2}, the expected MSE of this multimodel,

*r*will be referred to as the

*model error ratio*between

*M*2 and

*M*1. It quantifies the relative skill difference between

*M*1 and

*M*2. If

*r*= 1, the errors of both models have the same average magnitude, implying that they have equal skill. As

*r*gets smaller, the expected error magnitude of

*M*2 decreases with respect to

*M*1, implying that

*M*2 has higher skill than

*M*1.

Figure 2 shows, as a function of *r*, the effects of model averaging with equal weights. Without a loss of generality, we only show and discuss *r* ≤ 1 (i.e., *σ*_{M2} ≤ *σ*_{M1}). For the moment, we shall ignore the gray lines. Figure 2 shows the expected MSEs _{M1} (thin dotted–dashed line), _{M2} (thin dashed line), and _{M1}. It is easy to see that *r*; that is, the expected MSE of the combined projection is always lower than the average of the single model errors, an observation that has also been made in the verification of seasonal multimodel forecasts (e.g., Hagedorn et al. 2005; Palmer et al. 2004; Weigel et al. 2008b). For *r* ≥ *σ*_{M1} is not too different from *σ*_{M2}, the multimodel error *r* < *M*2 was considered rather than the multimodel. Thus, the optimum way of combining the available information is obviously a function of *r*.

*M*1 and

*M*2 such that the expected multimodel MSE becomes minimal for a given

*r*. Consider again the two climate projections Δ

*y*

_{M1}and Δ

*y*

_{M2}, which are now combined to a weighted average Δ

*y*

_{w}^{(2)}:with

*w*being the weight of

*M*1, and (1 −

*w*) being the weight of

*M*2. The expected MSE of this weighted multimodel,

_{w}

^{(2)}, is then given byMinimizing

_{w}

^{(2)}on

*w*yields as an optimum weight

*w*

_{opt}:Note that

*w*

_{opt}only depends on the error ratio

*r*, but not on the absolute values of

*σ*

_{M1}and

*σ*

_{M2}. As one would expect,

*w*

_{opt}approaches 0.5 as

*r*gets close to 1. For very large (very small) error ratios, on the other hand,

*w*

_{opt}approaches 1 (0), implying that all weight is put on

*M*1 (

*M*2). The values of

*w*

_{opt}as a function of

*r*have been added to Fig. 2 on the upper abscissa. Applying

*w*

_{opt}in Eq. (8) yields an expression for the optimum expected MSE

*r*has been included in Fig. 2 (solid gray line), showing that the optimally weighed multimodel clearly outperforms

_{M1},

_{M2}, and

*r*. Particularly for small values of

*r*, that is when

*M*1 and

*M*2 are very different in terms of their expected errors, model weighting can indeed strongly improve the projection quality with respect to the benchmark of equal weighting. However, this requires accurate knowledge of

*r*, which in practice is very difficult if not impossible to obtain (see discussion in section 4). What then happens in terms of the expected MSE if the models are combined with weights

*w*, which may be thought to be optimal, but which in fact do not reflect the true model error ratio? That is, what happens if weights are applied without knowing the true value of

*r*? Assuming that it is equally likely that by chance the optimum weight, the worst possible weight, or any other weight

*w*∈ [0, 1] is picked, we introduce

*r*. In other words, the application of weights that are independent of

*r*would on average yield larger errors than if no weights had been applied at all. This conclusion holds for any value of

*r*.

*ϵ*

_{M1}and

*ϵ*

_{M2}are independent of each other, and that the unpredictable noise

*ν*and

_{M}*ν*can be ignored. Under these assumptions, the combination of infinitely many models would eventually cancel out all model errors and yield a perfect climate projection. Indeed, if

_{x}*m*models are combined with equal weights, and if

*m*→ ∞, the expected multimodel projection

*ϵ*

_{M,i}being the model error of the

*i*th model. Optimally and randomly weighted multimodels can be shown to approach the same limit. The only difference is that the optimally weighted multimodel would converge more quickly than the equally weighted one, while the randomly weighted multimodel would converge more slowly. However, this limit of full error cancellation is not consistent with what has been observed in reality. For example, Knutti et al. (2010) have shown that half of the typical surface temperature biases of climate models would remain, even if an infinite number of models of the same quality were combined. The main reason is probably that different models share similar structural assumptions and in particular share the same unknown unknowns in terms of our physical process understanding, which can lead to correlated errors (e.g., Jun et al. 2008). We are aware that the conclusion of Knutti et al. (2010) refers to an analysis of model mean biases while our discussion focuses on climate projection errors. Nevertheless, their finding illustrates how correlated model errors can influence the effects of model averaging. We therefore now extend our discussion to the situation of joint model errors, that is, model errors which are “seen” by all models, while still ignoring the effects of unpredictable noise.

### b. The effect of joint model errors

*M*contributing to the multimodel, the model error

*ϵ*can be decomposed into a joint error contribution

_{M}*ϵ*, which is common to all models, and an independent residual error term

_{j}*ϵ*′

*; that is,*

_{M}*ϵ*=

_{M}*ϵ*+

_{j}*ϵ*′

*. For the combination of two models,*

_{M}*M*1 and

*M*2, this implies that the predicted climate change signals Δ

*y*

_{M1}and Δ

*y*

_{M2}of Eq. (3) and the weighted multimodel projection Δ

*y*

_{w}^{(2)}of Eq. (7) becomeThis situation is illustrated in Fig. 3. Note that now the combination of infinitely many models would not converge at Δ

*x*as in Eq. (12), but rather at (Δ

*x*+

*ϵ*), which is more consistent with the observed behavior of real multimodels. Let

_{j}*σ*′

_{M1},

*σ*′

_{M2}and

*σ*be the underlying uncertainties of

_{j}*ϵ*′

_{M1},

*ϵ*′

_{M2}and

*ϵ*. Assuming mutual independence of

_{j}*ϵ*′

_{M1},

*ϵ*′

_{M2}and

*ϵ*, the expected single model squared errors

_{j}_{M1}and

_{M2}are given byand the expected MSE of the weighted multimodel of Eq. (8) becomesHenceforth,

*j*will be referred to as the joint error fraction. This term measures the fraction of the root-mean-square error of

*M*1, which is equally seen by

*M*2. Minimizing Eq. (15) on

*w*yields as an expression for a revised optimum weightFigure 4 shows these optimum weights

*w*

_{opt}as a function of

*r*and

*j*. Note that

*j*≤

*r*always, since

*σ*, the model error uncertainty jointly seen by both

_{j}*M*1 and

*M*2, cannot be larger than

*σ*

_{M2}. The contour lines show that, for any

*r*, the optimum weight

*w*

_{opt}of

*M*1 decreases as

*j*increases. For example, if the model errors

*ϵ*

_{M1}and

*ϵ*

_{M2}are fully independent (

*j*= 0), an error ratio of

*r*= 0.6 would correspond to an optimum weight of approximately 0.26. However,

*w*

_{opt}would drop to 0.19 if

*j*= 0.4, that is if 40% of the root-mean-squared error of

*ϵ*

_{M1}contributes to the root-mean-square error of

*ϵ*

_{M2}; and

*w*

_{opt}would be zero if

*j*=

*r*= 0.6. In other words, as

*j*increases, more weight needs to be assigned to the better one of the two models than if the model errors were fully independent. The reason is that the improvement in skill is only possible by minimizing the contributions of the independent error components rather than the total model errors. That is, the error ratio characterizing the

*effective*skill difference between

*M*1 and

*M*2 is no longer given by (

*σ*

_{M2}/

*σ*

_{M1}), but rather by (

*σ*′

_{M2}/

*σ*′

_{M1}), which grows as

*j*is increased (for

*r*≤ 1). In summary, when the existence of joint model errors is neglected in the formulation of the optimum model weights, then the resulting estimates of

*w*

_{opt}would be implicitly biased. Too little weight would be assigned to the better one of the two models, and too much weight to the poorer one.

How does all this then affect the expected MSEs of the multimodel outcome? Figure 5 shows, in analogy to Fig. 2, the expected squared errors _{M1}, _{M2}, *j* = 0.2, (b) *j* = 0.5, and (c) *j* = 0.7. Here, *r* was accurately known and considered, but the existence of the joint model errors was neglected when calculating *w*_{opt}.

The following conclusions can be drawn from Fig. 5. As *j* increases, the net skill improvement of the multimodels with respect to the single models decreases, regardless of how the multimodel is constructed. This is plausible, since multimodels can only reduce the independent error components, whose magnitude decreases as *j* is increased. In relative terms, _{eq}. In fact, it is only for values of *j* ≳ 0.5 that *r* accurately but ignored the effects of joint errors, then the results would only be moderately deteriorated with respect to the optimum weights. In summary, correlated model errors have only a minor impact on the results in section 3a concerning the relative performance of weighted versus unweighted multimodels; however, they have a major impact on the absolute multimodel performance in comparison to the single models.

In the last part of this section, we now consider the additional effects arising from unpredictable noise. For simplicity, we return to the assumption of independent model errors; that is, *j* = 0.

### c. The effect of unpredictable noise

*x*, Δ

*y*, and

_{M}*. As described in section 2b, we assume that the noise terms*

_{M}*ν*,

_{x}*ν*

_{M1}, and

*ν*

_{M2}are independent samples from a distribution with expectation 0 and standard deviation

*σ*. The situation is illustrated in Fig. 6. The weighted multimodel projection of Eq. (7) then becomeswith an expected squared error ofHere,

_{ν}*R*relates the magnitude of the noise to that of the model error of

*M*1 and will henceforth be referred to as the relative noise ratio. The values of

*R*> 1 imply that the uncertainties due to noise exceed the model uncertainty, while

*R*= 0 corresponds to the situation of negligible noise as considered above in sections 3a and 3b. Minimizing Eq. (18) over

*w*yields the following as a revised expression for

*w*

_{opt}:Figure 7 shows

*w*

_{opt}as a function of

*r*and

*R*. The contour lines reveal that, for any

*r*,

*w*

_{opt}increases toward 0.5 as

*R*is increased. For instance, if noise is negligible (i.e.,

*R*= 0),

*r*= 0.6 corresponds to

*w*

_{opt}= 0.26. However, for

*R*= 0.5 one has

*w*

_{opt}= 0.33, while for

*R*= 1 one has

*w*

_{opt}= 0.40, and for

*R*→ ∞ the optimum weight approaches 0.5 for all

*r*. This behavior is plausible, because multimodel combination not only reduces the model errors but also the errors due to noise. Thus, as

*R*increases, the optimum compensation of noise errors becomes more and more important for the minimization of the total projection error, and under the assumptions made, the noise errors are optimally reduced by equal weighting. In summary, when the effects of noise are neglected in the formulation of optimum model weights, then the resulting estimates of

*w*

_{opt}are implicitly biased, with the bias growing quickly as

*R*becomes larger. The bias is such that too much weight would be given to the better one of the two models, and too little weight to the poorer one.

How does the presence of unpredictable noise then affect the quality of the multimodel projections? Figure 8 shows, in analogy to Figs. 2 and 5, the expected squared errors _{M1}, _{M2}, *R* = 0.5, (b) *R* = 1, and (c) *R* = 2. The definition of *r* was accurately known and considered, but the noise was neglected when calculating *w*_{opt}.

The results can be summarized as follows. As *R* increases, the difference between _{M1} and _{M2} decreases and the two models become more similar in terms of their net skill, because the individual model error terms *ϵ*_{M1} and *ϵ*_{M2} lose are diminished in relative importance with respect to the unpredictable noise. At the same time, the range of *r* values for which the equally weighted multimodel outperforms *M*2 (i.e., the better one of the two single models) grows. Indeed, in section 3a it has been noted that, under the absence of unpredictable noise, *r* ≥ *R* = 0.5, then *r* ≥ *R* ≥ *r* ∈ [0, 1]. Taking *R* becomes larger. This is simply because *w*_{opt} approaches 0.5 for large *R*, and thus *r* and *R*. Finally, note that the application of simplistic weights derived from Eq. (9) rather than Eq. (19) leads to a massive increase of the MSE with respect to *r* is small and *R* is on the order of 1 or larger. This illustrates how essential it is that the effects of unpredictable noise be quantified and considered when determining optimum weights. The implications and relevance of these findings will be further discussed in section 4. We finish this section with four remarks.

*ν*,

_{x}*ν*

_{M1}, and

*ν*

_{M2}are samples from the same distribution with variance

*σ*

_{ν}^{2}. However, our conceptual framework can be easily generalized to differing internal variabilities

*w*

_{opt}of Eq. (19) generalizes toFor large internal variabilities,

*w*

_{opt}then approaches

*R*

_{2}

^{2}/(

*R*

_{1}

^{2}+

*R*

_{2}

^{2}) rather than 0.5 as above.

*m*models are combined with equal weights, and if

*m*→ ∞, the expected multimodel projection

*μ*, but not the actual outcome Δ

*x*. This is plausible, because model combination can only cancel out the noise terms

*ν*stemming from internal model variability; the unpredictable noise of the observations,

_{M}*ν*, remains. Optimally and randomly weighted multimodels can be shown to approach the same limit. The only difference is that the optimally weighted multimodel converges more quickly than the equally weighted one, while the randomly weighted multimodel converges more slowly.

_{x}*N*

_{M1}and

*N*

_{M2}be the ensemble sizes of

*M*1 and

*M*2; that is,

*N*

_{M1}(

*N*

_{M2}) independent samples of

*ν*

_{M1}(

*ν*

_{M2}) are available. Averaging the ensemble members of each model prior to model combination yields the following expected MSEs:Thus, in comparison to Eq. (2) the contribution of noise to the total projection uncertainty is strongly reduced, but the contribution of model error remains. This has implications on

*w*

_{opt}, which is now given byIf

*N*

_{M1}and

*N*

_{M2}become very large, Eq. (24) approaches Eq. (9), that is, the value of

*w*

_{opt}for negligible noise. In other words, the availability of many ensemble members increases (reduces) the weight to be put on the better (weaker) of the two models—a pattern of behavior that has already been observed and discussed within the context of seasonal forecasting by Weigel et al. (2007).

Remark 4: In this section we have assumed that the model errors are independent (i.e., that *j* = 0). However, also under the presence of noise, it is straightforward to generalize the projection context to the situation of *j* > 0, as in section 3b. In this case, the optimum multimodel mean would converge to (Δ*μ* + *ϵ _{j}*) rather than Δ

*μ*, and the limit of the MSE would be (

*σ*

_{j}^{2}+

*σ*

_{ν}^{2}) rather than

*σ*

_{ν}^{2}. However, even more than in section 3b, the presence of joint errors has only minor implications on the relative performance of the weighted versus unweighted multimodels and will, therefore, not be further discussed here.

## 4. Discussion

As all results presented above are based on a simple conceptual framework, they are as such only valid to the degree that the underlying assumptions hold. Most likely, our most unrealistic assumption is that the emission uncertainty has been entirely ignored. In principle, emission uncertainty could be conceptually included in Eq. (1) by adding an emission scenario error term *s* to Δ*y _{M}*, such that Δ

*y*= Δ

_{M}*μ*+

*ϵ*+

_{M}*ν*+

_{M}*s*. However, in a multimodel ensemble, all contributing single models are typically subject to the same emission scenario assumptions and thus the same scenario error

*s*. Therefore, the impacts of emission uncertainty on the

*relative*performance of single models versus multimodels are probably very small. Rather, it is that the

*absolute*projection accuracy would be heavily affected, in that both single-model and multimodel MSEs would be systematically offset by

*s*

^{2}with respect to the errors discussed in section 3. This of course has severe consequences for our interpretation of climate projections in general, but does not affect our discussion on model weights. Apart from the issue of emission uncertainty, the conceptual framework involves many more simplifying assumptions, such as the omission of interaction terms between the different uncertainty sources, as mentioned by Déqué et al. (2007). However, we believe that by having explicitly considered the effects of skill difference (via

*r*), model error dependence (via

*j*), and noise (via

*R*), the conceptual framework, despite its simplicity, is realistic enough to allow some generally valid conclusions.

The least surprising conclusion to be drawn is probably that equally weighted multimodel combination *on average* improves the reliability of climate projections—a conclusion that is fully consistent with what is known from many verification studies in weather and seasonal forecasting (e.g., Hagedorn et al. 2005; Palmer et al. 2004; Weigel et al. 2008b). Regardless of which values for *r*, *j*, and *R* are chosen, the expected MSE of the multimodel is lower than the average MSE of the participating single models. Moreover, and again consistent with experience from shorter time scales, it has been shown that in principle model weighting can optimize the skill, if properly done. However, this requires an accurate knowledge of *r*—the key problem in the context of climate change.

Any estimate of *r* is to some degree necessarily based on the assessment of past and present model performance, and it needs to be extrapolated into the future to be of use for model weighting. In essence, the assumption must be made that *r* is stationary under a changing climate, which is problematic since other physical processes may become more relevant and dominant in the future than they are now (Knutti et al. 2010). This apprehension is backed by recent analyses of Christensen et al. (2008) and Buser et al. (2009), who have shown that systematic model errors are likely to change in a warming climate. However, even if *r* was stationary under a changing climate, we would still be confronted with the problem of how to determine a robust estimate of *r* on the basis of the available data. In contrast to, say, seasonal forecasting, the multidecadal time scale of the predictand strongly limits the number of independent verification samples that could be used to quantify *r*. This problem is aggravated by the fact that over the larger part of the past century, the anthropogenic climate change signal was relatively weak in comparison to the internal variability. Indeed, Kumar (2009) has shown that for small signal-to-noise ratios (on the order of 0.5) even 25 independent verification samples, a sample size which would actually be very large on multidecadal time scales, is hardly enough to obtain statistically robust skill estimates. Attempts have been made to try and circumvent this sampling issue by estimating model error uncertainties on the basis of other variables that can be verified more easily, such as systematic model biases (e.g., Giorgi and Mearns 2002). However, this leaves the question as to whether such “alternative” variables are representative for a model’s ability to quantify multidecadal climate change signals. Whetton et al. (2007), Jun et al. (2008), Knutti et al. (2010), and other studies, for example, show that the correlations between present-day model performance (in terms of such alternative variables) and future changes are in fact weak, and within the context of monthly forecasting, Weigel et al. (2008a) have shown that those areas with the best bias characteristics are not necessarily those areas with the highest monthly prediction skill.

Given all of these fundamental problems in quantifying *r*, it seems that at the moment there is no consensus on how robust model weights can be derived in the sense of Eq. (9)—apart from one exception: If we *know* a priori that a given model *M*1 *cannot* provide a meaningful estimate of future climate while another model *M*2 *can* (e.g., because *M*1 is known to lack important key mechanisms that are indispensable to providing correct climate projections, while *M*2 has them included), then it may be justifiable to assume that *σ*_{M2} ≪ *σ*_{M1} and thus *r* = 0. For small *R*, this would then correspond to removing *M*1 entirely from the multimodel ensemble. In fact, some studies have found more consistent projections when eliminating poor models (e.g., Walsh et al. 2008; Perkins and Pitman 2009; Scherrer 2010). In the general sense, however, model weights bear a high risk of not being representative of the underlying uncertainties. In fact, we believe that the possibility of inadvertently assigning nearly random weights as analyzed in section 3 is not just an academic play of thoughts, but rather a realistic scenario.

Under such conditions, the weighted multimodel yields on average larger errors than if the models had been combined in an equally weighted fashion. In fact, unless *r* and *R* are very small, the potential loss in projection accuracy by applying unrepresentative weights is on average even larger than the potential gain in accuracy by optimum weighting. Also this aspect finds its equivalent in the context of seasonal forecasting. In an analysis of 2-m temperature forecasts stemming from 40 yr of hindcast data of two seasonal prediction systems, Weigel et al. (2008b) have shown that the equally weighted combination of these two models yields on average higher skill than any of the two single models alone, and that the skill can be further improved if optimum weights are applied (the optimum weights have thereby been defined grid-point wise). However, if the amount of independent training data is systematically reduced, the weight estimates become more uncertain and the average prediction skill drops (see Table 1 for skill values). In fact, if the weights are obtained from less than 20 yr of hindcast data, weighted multimodel forecasts are outperformed by the equally weighted ones. Particularly low skill is obtained for random weights, as can be seen in Table 1. However, note that even the randomly weighted multimodel still outperforms both single models.

In summary, our results suggest that, within the context of climate change, model combination with equal rather than performance-based weights may well be the safer and more transparent strategy to obtain optimum results. These arguments are further strengthened if the magnitude of the noise becomes comparable to or even larger than the model error uncertainty; that is, if *R* ≳ 1. Under these conditions, the optimum weights have been shown to approach 0.5. This means, for large *R*, equal weighting essentially *is* the optimum way to weight the models (see Figs. 8b and 8c), at least if the models to be combined have comparable internal variability. Table 2 provides some rough estimates of *R* obtained from other studies found in the literature. While these studies are based on different methods and projection contexts, which can lead to considerably different estimates of *R*, they all show that *R* can indeed be large enough so that the application of model weights would only be of moderate use, even if the model error ratios were accurately known. This is particularly relevant if variables with low signal-to-noise ratios are considered (e.g., precipitation rather than temperature), if relatively small spatial and temporal aggregations are evaluated (e.g., a 10-yr average over central Europe rather than a 30-yr global average), if the lead times are comparatively short (e.g., 20 yr rather than 100 yr), and if no ensembles are available to sample the uncertainty in the initial conditions.

## 5. Conclusions

Multimodel combination is a pragmatic and well-accepted technique to estimate the range of uncertainties induced by model error and to improve the climate projections. The simplest way to construct a multimodel is to give one vote to each model, that is, to combine the models with equal weights. Since models differ in their quality and prediction skill, weighting the participating models according to their prior performance has been suggested, which is an approach that has been proven to be successful in weather and seasonal forecasting. In the present study, we have analyzed the prospects and risks of model weighting within the context of multidecadal climate change projections. It has been our aim to arrive at a conclusion as to whether or not the application of model weights can be recommended.

On shorter time scales, such an assessment can be carried out in the form of a statistically robust verification of the predictand of interest. For climate change projections, however, this is hardly possible due to the long time scales involved. Therefore, our study has been based on an idealized framework of climate change projections. This framework has been designed such that it allows us to assess, in generic terms, the effects of multimodel combination independently of the model error magnitudes, the degree of model error correlation, and the amount of unpredictable noise (internal variability). The key results, many of which are consistent with experience from seasonal forecasting, can be summarized as follows:

- Equally weighted multimodels yield, on average, more accurate projections than do the participating single models alone, at least if the skill difference between the single models is not too large.
- The projection errors can be further reduced by model weighting, at least in principle. The optimum weights are thereby not only a function of the single model error uncertainties, but also depend on the degree of model error correlation and the relative magnitude of the unpredictable noise. Neglecting the latter two aspects can lead to severely biased estimates of optimum weights. If model error correlation is neglected, the skill difference between the two models is underestimated; if internal variability is neglected, the skill difference is overestimated.
- Evidence from several studies suggests that the task of finding robust and representative weights for climate models is certainly a difficult problem. This is due to (i) the inconveniently long time scales considered, which strongly limit the number of available verification samples; (ii) nonstationarities of model skill under a changing climate; and (iii) the lack of convincing alternative ways to accurately determine skill.
- If model weights are applied that do not reflect the true model error uncertainties, then the weighted multimodel may have much lower skill than the unweighted one. In many cases, more information may actually be lost by inappropriate weighting than can potentially be gained by optimum weighting.
- This asymmetry between potential loss due to inappropriate weights and potential gain due to optimum weights grows under the influence of unpredictable noise. In fact, if the noise is of comparable or even larger magnitude than the model errors, then equal weighting essentially becomes the optimum way to construct a multimodel, at least if the models to be combined have similar internal variability. In practice, this is particularly relevant if variables with low signal-to-noise ratios are considered (e.g., precipitation rather than temperature), if high spatial and temporal detail is required, if the lead times are short, and if no ensemble members are available to sample the uncertainty of the initial conditions.

These results do not imply that the derivation of performance-based weights is impossible by principle. In fact, near-term (decadal) climate predictions, such as those planned for the Intergovernmental Panel on Climate Change’s (IPCC) fifth assessment report (Meehl et al. 2009), may contribute significantly to this objective in that they can serve as a valuable test bed for assessing projection uncertainties and characterizing model performance. Moreover, also within the presented framework eliminating models from an ensemble can be justified if they are known to lack key mechanisms that are indispensable for meaningful climate projections. However, our results do imply that a decision to weight the climate models should be made with the greatest care. Unless there is a clear relation between what we observe and what we predict, the risk of reducing the projection accuracy by inappropriate weights appears to be higher than the prospect of improving it by optimum weights. Given the current difficulties in determining reliable weights, for many applications equal weighing may well be the safer and more transparent way to proceed.

Having said that, the construction of equally weighted multimodels is not trivial, either. In fact, many climate models share basic structural assumptions, process uncertainties, numerical schemes, and data sources, implying that with a simple “each model one vote” strategy truly equal weights cannot be accomplished. An even higher level of complexity is reached when climate projections are combined that stem from multiple GCM-driven regional climate models (RCMs). Very often in such a downscaled scenario context, some of the available RCMs have been driven by the same GCM, while others have been driven by different GCMs (e.g., Van der Linden and Mitchell 2009). Assigning one vote to each model chain may then result in some of the GCMs receiving more weight than others, depending on how many RCMs have been driven by the same GCM.

Given these problems and challenges, model combination with equal weights cannot be considered to be a final solution, either, but rather a starting point for further discussion and research.

This study was supported by the Swiss National Science Foundation through the National Centre for Competence in Research (NCCR) Climate and by the ENSEMBLES project (EU FP6, Contract GOCE-CT-2003-505539). Helpful comments of Andreas Fischer are acknowledged.

## REFERENCES

Allen, M. R., , and W. J. Ingram, 2002: Constraints on future changes in climate and the hydrological cycle.

,*Nature***419****,**224–232.Boé, J., , A. Hall, , and X. Qu, 2009: September sea-ice cover in the Arctic Ocean projected to vanish by 2100.

,*Nat. Geosci.***2****,**341–343. doi:10.1038/NGEO467.Buizza, R., 1997: Potential forecast skill of ensemble prediction, and spread and skill distributions of the ECMWF Ensemble Prediction System.

,*Mon. Wea. Rev.***125****,**99–119.Buser, C. M., , H. R. Künsch, , D. Lüthi, , M. Wild, , and C. Schär, 2009: Bayesian multimodel projection of climate: Bias assumptions and interannual variability.

,*Climate Dyn.***33****,**849–868. doi:10.1007/s00382-009-0588-6.Christensen, J. H., , F. Boberg, , O. B. Christensen, , and P. Lucas-Picher, 2008: On the need for bias correction of regional climate change projections of temperature and precipitation.

,*Geophys. Res. Lett.***35****,**L20709. doi:10.1029/2008GL035694.Cox, P., , and D. Stephenson, 2007: A changing climate for prediction.

,*Science***317****,**207–208.Déqué, M., and Coauthors, 2007: An intercomparison of regional climate simulations for Europe: Assessing uncertainties in model projections.

,*Climatic Change***81****,**53–70.Doblas-Reyes, F. J., , R. Hagedorn, , and T. N. Palmer, 2005: The rationale behind the success of multimodel ensembles in seasonal forecasting. Part II: Calibration and combination.

,*Tellus***57A****,**234–252.Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories.

,*J. Appl. Meteor.***8****,**985–987.Frame, D. J., , N. E. Faull, , M. M. Joshi, , and M. R. Allen, 2007: Probabilistic climate forecasts and inductive problems.

,*Phil. Trans. Roy. Soc.***365A****,**1971–1992.Giorgi, F., , and L. O. Mearns, 2002: Calculation of average, uncertainty range, and reliability of regional climate changes from AOGCM simulations via the “reliability ensemble averaging” (REA) method.

,*J. Climate***15****,**1141–1158.Giorgi, F., , and L. O. Mearns, 2003: Probability of regional climate change based on the reliability ensemble averaging (REA) method.

,*Geophys. Res. Lett.***30****,**1629. doi:10.1029/2003GL017130.Gleckler, P. J., , K. E. Taylor, , and C. Doutriaux, 2008: Performance metrics for climate models.

,*J. Geophys. Res.***113****,**D06104. doi:10.1029/2007JD008972.Greene, A. M., , L. Goddard, , and U. Lall, 2006: Probabilistic multimodel regional temperature change projections.

,*J. Climate***19****,**4326–4343.Hagedorn, R., , F. J. Doblas-Reyes, , and T. N. Palmer, 2005: The rationale behind the success of multimodel ensembles in seasonal forecasting. Part I: Basic concept.

,*Tellus***57A****,**219–233.Hawkins, E., , and R. Sutton, 2009: The potential to narrow uncertainty in regional climate predictions.

,*Bull. Amer. Meteor. Soc.***90****,**1095–1107.Hawkins, E., , and R. Sutton, 2010: The potential to narrow uncertainty of regional precipitation change.

, in press, doi:10.1007/s00382-010-0810-6.*Climate Dyn.*Jun, M., , R. Knutti, , and D. W. Nychka, 2008: Spatial analysis to quantify numerical model bias and dependence: How many climate models are there?

,*J. Amer. Stat. Assoc.***103****,**934–947.Kalnay, E., 2003:

*Atmospheric Modeling, Data Assimilation and Predictability*. Cambridge University Press, 341 pp.Kharin, V. V., , and F. W. Zwiers, 2003: Improved seasonal probability forecasts.

,*J. Climate***16****,**1684–1701.Knutti, R., 2008: Should we believe model predictions of future climate change?

,*Philos. Trans. Roy. Soc.***366A****,**4647–4664.Knutti, R., , R. Furrer, , C. Tebaldi, , and J. Cermak, 2010: Challenges in combining projections from multiple climate models.

,*J. Climate***23****,**2739–2758.Kumar, A., 2009: Finite samples and uncertainty estimates for skill measures for seasonal prediction.

,*Mon. Wea. Rev.***137****,**2622–2631.Liniger, M. A., , H. Mathis, , C. Appenzeller, , and F. J. Doblas-Reyes, 2007: Realistic greenhouse gas forcing and seasonal forecasts.

,*Geophys. Res. Lett.***34****,**L04705. doi:10.1029/2006GL028335.Lucas-Picher, P., , D. Caya, , R. de Elía, , and R. Laprise, 2008: Investigation of regional climate models’ internal variability with a ten-member ensemble of 10-year simulations over a large domain.

,*Climate Dyn.***31****,**927–940.Meehl, G. A., and Coauthors, 2009: Decadal prediction: Can it be skillful?

,*Bull. Amer. Meteor. Soc.***90****,**1467–1485.Murphy, J. M., , D. M. H. Sexton, , D. N. Barnett, , G. S. Jones, , M. J. Webb, , M. Collins, , and D. A. Stainforth, 2004: Quantification of modelling uncertainties in a large ensemble of climate change simulations.

,*Nature***430****,**768–772.Nakicenovic, N., , and R. Swart, Eds. 2000:

*Special Report on Emissions Scenarios. A Special Report of Working Group III of the Intergovernmental Panel on Climate Change*. Cambridge University Press, 599 pp.Palmer, T. N., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal-to-Interannual Prediction (DEMETER).

,*Bull. Amer. Meteor. Soc.***85****,**853–872.Perkins, S. E., , and A. J. Pitman, 2009: Do weak AR4 model bias projections of future climate change over Australia?

,*Climatic Change***93****,**527–558.Popper, K. R., 1959: The propensity interpretation of probability.

,*Brit. J. Philos. Sci.***10****,**25–42.Raftery, A. E., , T. Gneiting, , F. Balabdaoui, , and M. Polakowski, 2005: Using Bayesian model averaging to calibrate forecast ensembles.

,*Mon. Wea. Rev.***133****,**1155–1174.Räisänen, J., 2007: How reliable are climate models?

,*Tellus***59A****,**2–29.Rajagopalan, B., , U. Lall, , and S. E. Zebiak, 2002: Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles.

,*Mon. Wea. Rev.***130****,**1792–1811.Reifen, C., , and R. Toumi, 2009: Climate projections: Past performance no guarantee of future skill?

,*Geophys. Res. Lett.***36****,**L13704. doi:10.1029/2009GL038082.Robertson, A. W., , U. Lall, , S. E. Zebiak, , and L. Goddard, 2004: Improved combination of multiple atmospheric GCM ensembles for seasonal prediction.

,*Mon. Wea. Rev.***132****,**2732–2744.Rougier, J., 2007: Probabilistic inference for future climate using an ensemble of climate model evaluations.

,*Climatic Change***81****,**247–264.Roulston, M. S., , and L. A. Smith, 2002: Evaluating probabilistic forecasts using information theory.

,*Mon. Wea. Rev.***130****,**1653–1660.Scherrer, S. C., 2010: Present-day interannual variability of surface climate in CMIP3 models and its relation to future warming.

, in press, doi:10.1002/joc.2170.*Int. J. Climatol.*Smith, L. A., 2002: What might we learn from climate forecasts?

,*Proc. Natl. Acad. Sci. USA***99****,**2487–2492.Solomon, S., , D. Qin, , M. Manning, , M. Marquis, , K. Averyt, , M. M. B. Tignor, , H. L. Miller Jr., , and Z. Chen, Eds. 2007:

*Climate Change 2007: The Physical Sciences Basis*. Cambridge University Press, 996 pp.Stainforth, D. A., , M. R. Allen, , E. R. Tredger, , and L. A. Smith, 2007: Confidence, uncertainty and decision-support relevance in climate predictions.

,*Philos. Trans. Roy. Soc. London***365A****,**2145–2161.Stephenson, D. B., , C. A. S. Coelho, , F. J. Doblas-Reyes, , and M. Balmaseda, 2005: Forecast assimilation: A unified framework for the combination of multimodel weather and climate predictions.

,*Tellus***57A****,**253–264.Stott, P. A., , S. F. B. Tett, , G. S. Jones, , M. R. Allen, , J. F. B. Mitchell, , and G. J. Jenkins, 2000: External control of 20th century temperature by natural and anthropogenic forcings.

,*Science***290****,**2133–2137.Tebaldi, C., , and R. Knutti, 2007: The use of the multimodel ensemble in probabilistic climate projections.

,*Philos. Trans. Roy. Soc.***365A****,**2053–2075.Tebaldi, C., , R. L. Smith, , D. Nychka, , and L. O. Mearns, 2005: Quantifying uncertainty in projections of regional climate change: A Bayesian approach to the analysis of multimodel ensembles.

,*J. Climate***18****,**1524–1540.Uppala, S. M., and Coauthors, 2005: The ERA-40 Re-Analysis.

,*Quart. J. Roy. Meteor. Soc.***131****,**2961–3012.Van der Linden, P., , and J. F. B. Mitchell, Eds. 2009: ENSEMBLES: Climate change and its impacts at seasonal, decadal and centennial timescales. Summary of research and results from the ENSEMBLES project. Met Office Hadley Centre, 160 pp. [Available from Met Office Hadley Centre, FitzRoy Road, Exeter AU3 EX1 3PB, United Kingdom].

Walsh, J. E., , W. L. Chapman, , V. Romanovsky, , J. H. Christensen, , and M. Stendel, 2008: Global climate model performance over Alaska and Greenland.

,*J. Climate***21****,**6156–6174.Webster, M. D., 2003: Communicating climate change uncertainty to policy-makers and the public.

,*Climatic Change***61****,**1–8.Weigel, A. P., , M. A. Liniger, , and C. Appenzeller, 2007: Generalization of the discrete Brier and ranked probability skill scores for weighted multimodel ensemble forecasts.

,*Mon. Wea. Rev.***135****,**2778–2785.Weigel, A. P., , D. Baggenstos, , M. A. Liniger, , F. Vitart, , and C. Appenzeller, 2008a: Probabilistic verification of monthly temperature forecasts.

,*Mon. Wea. Rev.***136****,**5162–5182.Weigel, A. P., , M. A. Liniger, , and C. Appenzeller, 2008b: Can multimodel combination really enhance the prediction skill of ensemble forecasts?

,*Quart. J. Roy. Meteor. Soc.***134****,**241–260.Weigel, A. P., , M. A. Liniger, , and C. Appenzeller, 2009: Seasonal ensemble forecasts: Are recalibrated single models better than multimodels?

,*Mon. Wea. Rev.***137****,**1460–1479.Whetton, P., , I. Macadam, , J. Bathols, , and J. O’Grady, 2007: Assessment of the use of current climate patterns to evaluate regional enhanced greenhouse response patterns of climate models.

,*Geophys. Res. Lett.***34****,**L14701. doi:10.1029/2007GL030025.

Average global prediction skill of seasonal forecasts (June–August) of 2-m temperature with a lead time of 1 month, obtained from the Development of a European Multimodel Ensemble System for Seasonal-to-Interannual Prediction (DEMETER) database (Palmer et al. 2004) and verified against 40-yr European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analysis (ERA-40) data (Uppala et al. 2005) for the period 1960–2001. Skill is measured by the positively oriented ranked probability skill score (RPSS; Epstein 1969). The verification context is described in detail in Weigel et al. (2008b). Shown is the RPSS for ECMWF’s “System 2” (*M*1), for the Met Office’s “GloSea” (*M*2), and for multimodels (*MM*) constructed from *M*1 and *M*2 with (i) equal weights; (ii) with optimum weights obtained grid-point wise from 40, 20, and 10 yr of hindcast data by optimizing the ignorance score of Roulston and Smith (2002); and (iii) with random weights. Skill values are given in percent.

Selection of relative noise ratio values (*R*) as estimated from the literature. Note that different methodologies have been applied in the studies cited.