The distribution of model-based estimates of equilibrium climate sensitivity has not changed substantially in more than 30 years. Efforts to narrow this distribution by weighting projections according to measures of model fidelity have so far failed, largely because climate sensitivity is independent of current measures of skill in current ensembles of models. This work presents a cautionary example showing that measures of model fidelity that are effective at narrowing the distribution of future projections (because they are systematically related to climate sensitivity in an ensemble of models) may be poor measures of the likelihood that a model will provide an accurate estimate of climate sensitivity (and thus degrade distributions of projections if they are used as weights). Furthermore, it appears unlikely that statistical tests alone can identify robust measures of likelihood. The conclusions are drawn from two ensembles: one obtained by perturbing parameters in a single climate model and a second containing the majority of the world’s climate models. The simple ensemble reproduces many aspects of the multimodel ensemble, including the distributions of skill in reproducing the present-day climatology of clouds and radiation, the distribution of climate sensitivity, and the dependence of climate sensitivity on certain cloud regimes. Weighting by error measures targeted on those regimes permits the development of tighter relationships between climate sensitivity and model error and, hence, narrower distributions of climate sensitivity in the simple ensemble. These relationships, however, do not carry into the multimodel ensemble. This suggests that model weighting based on statistical relationships alone is unfounded and perhaps that climate model errors are still large enough that model weighting is not sensible.
1. Model error and climate sensitivity
Equilibrium climate sensitivity, defined as the response in global-mean near-surface temperature to a doubling of atmospheric CO2 concentrations from preindustrial levels, is a useful proxy for climate change because many other projections scale with it. Climate models produce a range of estimates of climate sensitivity that can themselves be sensitive to fairly small changes in model formulation (Soden et al. 2004). The distribution of these projections has remained roughly the same for more than 30 years (cf. Charney et al. 1979; Meehl et al. 2007b).
One might expect that with improvements of climate models over time, projections would converge to a narrower distribution, but this has not yet proved true: successive generations of climate models have produced improved simulations of the present-day climate (Reichler and Kim 2008) but commensurate distributions of climate sensitivity (Knutti et al. 2008).
The distribution might also be narrowed by invoking Bayes’ theorem and weighting each prediction of climate sensitivity by the likelihood of the corresponding model (Murphy et al. 2004; Stainforth et al. 2005; Knutti et al. 2010). This likelihood is usually modeled as a decreasing function of model error, defined as some measure of the difference between long-term averages of observations and model simulations of the present-day climate. Weighting ensembles is fraught with theoretical issues, including the impact of the sampling strategy used to construct the initial ensemble (Frame et al. 2005) and questions of how to treat an ensemble in which members have varying degrees of interdependence (e.g., Knutti et al. 2010; Tebaldi and Knutti 2007). But weighting projections has so far failed to substantially narrow distributions of climate sensitivity for a more practical reason: in current ensembles of climate models, global measures of error are not systematically related to climate sensitivity or the underlying feedbacks (Knutti et al. 2006; Murphy et al. 2004; Piani et al. 2005; Sanderson et al. 2008; Collins et al. 2011).
Any observable measure of present-day error that is correlated with climate sensitivity in a given ensemble of climate projections, if used as a weight, would narrow the distribution of climate sensitivity estimates. This makes it tempting to seek such measures. But, if the systematic relationships between the present day and the future in an ensemble of models have causes that are not shared by the physical climate system, weighting by such a measure can introduce substantial projection errors (Weigel et al. 2010).
Here we provide a practical demonstration of how hard it can be to determine whether relationships between the present day and the future in a given ensemble have a more general basis. We consider two ensembles of climate models: one containing a wide range of models and another employing a single model with varied values of closure parameters. We use the simpler single-model ensemble as a proxy for understanding the behavior of the more complicated multimodel ensemble, much as one might use the more complicated ensemble to understand the real world. Section 2 describes the construction of the simple ensemble; we then show that this simple ensemble reproduces several relevant aspects of the multimodel ensemble. Section 4 describes the construction of a metric of present-day performance that is correlated with climate sensitivity in the simple model but does not generalize to the multimodel ensemble. We conclude by exploring the implications for model weighting.
2. A simple ensemble spanning a range of errors and climate sensitivities
We construct a perturbed-parameter ensemble by varying the values of selected closure parameters (Table 1) in physical parameterizations of the general circulation model ECHAM5 (Roeckner et al. 2003). The parameters are uncertain in observations and are those used to adjust the model so that its energy budget is balanced at the top of the atmosphere (to within observational uncertainties and accounting for ocean heat storage). Each parameter is restricted to fairly small ranges near the default, and all parameters are sampled simultaneously using Latin hypercube sampling (McKay et al. 1979). Five hundred realizations of ECHAM5 are created, and each model is run for a single year using present-day climatological distributions of sea ice and sea surface temperature.
For each ensemble member we compute an aggregate measure of the error in simulating the present-day distribution of clouds, radiation, and precipitation. Because it is not known which observable aspects, if any, of the present-day climate are connected to climate sensitivity, any aggregate metric is arbitrary; we justify the narrow focus of our choice by noting that (i) differences in cloud feedbacks drive much of the diversity in climate sensitivity estimates from climate models (Soden and Held 2006), particularly by affecting the radiation budget, and (ii) a majority of the varied parameters are cloud related. We compute the root-mean-square error relative to observations for cloud fraction, longwave and shortwave cloud radiative effects at the top of the atmosphere (e.g., Hartmann and Short 1980), and surface precipitation over each month of the annual cycle using the observations and methodology described by Pincus et al. (2008). These errors are much larger in our short integrations than for long runs with well-tuned models because sampling errors are large. Still, the difference in errors based on individual years from longer runs (described below) is very small relative to the difference in error spanned by the ensemble, indicating that the diversity in error is robust. Errors in individual fields are standardized so that the distribution of each error across the ensemble has zero mean and a standard deviation of one, then added together to provide an aggregate error measure for each model, where low errors reflect greater skill relative to other members of the ensemble.
We sort the models according to this measure of aggregate error and compute the equilibrium climate sensitivity of every tenth model across the range of aggregate skill (so that the distribution of skill in the initial ensemble is roughly preserved). Ten-year runs are performed using a slab ocean model and present-day greenhouse gas concentrations, from which we determine the flux corrections necessary to maintain present-day sea surface temperatures. A 50-yr simulation is then performed using the same ocean heat flux corrections but with doubled carbon dioxide concentrations. Equilibrium climate sensitivity is computed as the difference in global mean surface temperature between the last ten years of the doubled CO2 and the present-day simulations.
3. The simple ensemble as proxy for the multimodel ensemble
Results from this ensemble, in which all diversity arises from parametric uncertainty, are comparable in many ways to the multimodel ensemble from the World Climate Research Programme’s Coupled Model Intercomparison Project phase 3 (CMIP3, see Meehl et al. 2007a), which represents the majority of the world’s climate models and contains both parametric and structural variability. In particular, the distributions of climate sensitivity (Fig. 1a) and our aggregated measure of global cloud-related model error (Fig. 1b) are similar in both ensembles. These quantities are not systematically related to each other in either ensemble (Fig. 2). The similarity in the distributions of error and sensitivity, as well as the lack of a connection between the two, mirror previous experiences across a wide range of perturbed-parameter ensembles (Murphy et al. 2004; Stainforth et al. 2005; Collins et al. 2011).
The two ensembles also share an important structural feature: the same mechanism underlies the variability in climate sensitivity. In both ensembles models with a large change in the net cloud radiative effect under doubled CO2 concentrations are those with higher climate sensitivity (Fig. 1a). The longwave cloud radiative effect in our ensemble does not change much between present-day and doubled CO2 conditions, which is also consistent with robust (positive) longwave cloud feedbacks across the CMIP3 simulations (Zelinka and Hartmann 2010). The diversity in shortwave cloud radiative effect (CRESW) changes, in turn, is largely driven by diversity in the response of low-latitude oceanic boundary layer clouds (Bony and Dufresne 2005).
By these measures, the perturbed-parameter ensemble is a successful proxy for the multimodel ensemble. This allows us to test the generality of model weighting techniques in two structurally distinct but statistically similar ensembles.
4. Developing measures of model error linked to climate sensitivity
We now design a measure of error in reproducing the present-day climate that is explicitly related to climate sensitivity in our simple ensemble. We identify such a measure by focusing on the low-latitude oceanic boundary layer clouds whose response is tightly linked to climate sensitivity (Bony and Dufresne 2005). Boundary layer clouds dominate CRESW in subsidence regions, that is, where the midtropospheric pressure velocity is downward (ω500 > 0), so we sort present-day CRESW by this quantity (Bony et al. 2004). In our ensemble the present-day distribution of CRESW in subsidence regions differs markedly between the 10 highest- and 10 lowest-sensitivity model variants (Fig. 3a). Higher sensitivity models have weaker values of CRESW, indicating that clouds are some combination of less frequent, less extensive, or less reflective than in low-sensitivity simulations. The higher sensitivity models are also more consistent with observations (here, cloud radiative effect derived from satellite observations; Wielicki et al. 1996; Loeb et al. 2009) and sorted by ω500 inferred from European Centre for Medium-Range Weather Forecasts Interim reanalysis (ERA-Interim) data (Simmons et al. 2007). Although the highest- and lowest-sensitivity models in our ensemble are distinct from each other, at the most frequent values of subsidence essentially all members overestimate CRESW relative to observations. In regions of large-scale ascent (ω500 < 0), the distributions of CRESW in the highest- and lowest-sensitivity models are much broader and overlap significantly.
In nature, boundary layer clouds in subsiding regions over the oceans are further correlated (Medeiros and Stevens 2011) with lower-tropospheric thermodynamic stability (LTS) (see Bretherton and Wyant 1997; Klein and Hartmann 1993), here defined as the difference in the potential temperature at 1000 and 700 hPa. Our simple ensemble reproduces this dependency as well (Fig. 3b). Through much of the range of LTS, the highest- and lowest-sensitivity models are indistinguishable, but in the range 13 < LTS < 17 K CRESW in the high-sensitivity models is consistently weaker, and in better agreement with observations, than for low-sensitivity models. These are the most frequent values of LTS in subsiding regions in our ensemble.
Figure 3 demonstrates why global measures of skill are unrelated to model climate sensitivity: because the clouds whose systematic changes explain the diversity in sensitivity occur in a small region of the globe. Most measures of skill compare models to observations in global domains (e.g., Gleckler et al. 2008; Pincus et al. 2008; Reichler and Kim 2008). Restricting the geographical domain over which errors are computed would not change this result much: even considering only the low-latitude oceans, the root-mean-square difference with observations are influenced not only by the regions controlling the sensitivity but also by ascending regions, where errors are large, and low-sensitivity models perform somewhat better, on average.
We define instead a conditioned error measure Ec as the root-mean-square difference between model simulations and observations of CRESW integrated over regions with large-scale subsidence (ω500 > 0 Pa s−1) and moderate lower-tropospheric stability (13 < LTS < 17 K). Regions satisfying both conditions comprise just 5% of the area of the tropics (2.5% of the globe) in the observations and somewhat more in the models. Nonetheless, Ec is a reasonably good predictor of climate sensitivity in the simple ensemble (Fig. 4), which means it can be used to narrow the distribution of climate sensitivity estimates. Figure 4b shows the distribution of climate sensitivity obtained from the perturbed-parameter ensemble before and after weighting by the likelihood (Murphy et al. 2004). The standard deviation of the posterior distribution is three-quarters of that of the prior distribution, mostly because a few models with low sensitivity have large errors and hence low weight. The mean climate sensitivity also increases by 0.35 K.
But, despite the many similarities between the perturbed-parameter and multimodel ensembles, the systematic relationship between climate sensitivity and Ec does not carry into the multimodel ensemble (Fig. 5), nor does the distribution of sensitivity estimates from the multimodel ensemble change when weighted by L(Ec).
5. Implications for weighting projections from multimodel ensembles
One could conclude that we have obtained a null result and that the single-model perturbed-parameter ensemble is, after all, a poor proxy for the multimodel ensemble. Instead, we propose that these calculations are a concrete illustration of some of the issues involved in the weighting and more general interpretation of multimodel ensembles.
First, our results confirm that it is possible to obtain distributions of climate sensitivity and global measures of error as diverse as those produced by the multimodel ensemble with even modest variations about a single model. This suggests that variability in error and sensitivity at these levels is easy to come by (though why this is so remains an intriguing open question). In fact, in our ensemble diversity in skill and climate sensitivity arises from surprisingly simple parametric sensitivity: Climate sensitivity is primarily related to the entrainment rate for shallow convection, which varies along with a cloud mass flux parameter (explaining 44% of the variance in climate sensitivity, Table 1) while aggregate error is related to another parameter, the entrainment rate for deep convection (explaining 64% of the variance in aggregated error; Table 1). If broad diversity in behavior can arise from underlying simplicity, then the diversity itself is uninformative. This is an illustrative reminder that the distribution of climate sensitivity from any model ensemble cannot be interpreted as an estimate of the total uncertainty in climate sensitivity.
Second, while the motivation to narrow the distribution of climate sensitivity estimates is strong, our results dramatize the danger of focusing exclusively on this goal. Relationships between sensitivity and model fidelity in any ensemble emerge from an unknown mix of underlying similarity in model representation and error, statistical sampling error, and physical relationships also present in the natural world. This means that arbitrarily chosen error measures may arise from underlying similarity not present in the physical climate system. We argue that, because metrics developed from the full multimodel ensemble alone cannot be falsified by comparison to more general ensembles, they cannot be justified as a model likelihood purely on the basis of the strength of the statistical connection between that metric and climate sensitivity. Indeed, where observations have been used successfully to constrain model response (Hall and Qu 2006; Clement et al. 2009) statistical metrics have been bolstered by physical arguments. Much depends on the way weights are chosen since incorrect weighting (i.e., weighting not related to true model likelihood) can substantially reduce the benefits of using an ensemble of projections (Weigel et al. 2010).
Finally, it is possible that present-day models are not yet sufficiently accurate to benefit from model weighting. Weighting model projections by skill is an assertion that models are likely to produce accurate estimates of future climate in proportion to their ability to reproduce some aspects of the present-day climate; the implicit assumption is that models with higher skill are more likely to be accurate representations of the physical climate system. But, by most measures no current climate model produces distributions of the present-day climate statistically consistent with observations (Gleckler et al. 2008; Pincus et al. 2008, see also Figs. 3 and 5), implying that all models are formally unlikely. Weighting an ensemble under these circumstances is essentially asserting that incorrect models are more reliable than even more incorrect models. But the result of Bayes’ theorem is ambiguous when the system being modeled is far from the system being observed, so it may be that model weighting will be more profitable when the collection of the models that we have is closer to the world we observe.
We thank the Max Planck Society, the International Max Planck Research School for Earth System Modelling, the National Science Foundation’s Center for Multi-Scale Modeling of Atmospheric Processes, and the German Research Foundation Emmy Noether grant program for supporting this work. Bjorn Stevens, Louise Nuijens, Thorsten Mauritsen, and Jeffrey L. Anderson provided valuable feedback on early drafts of this paper, and three anonymous reviewers helped us refine the arguments. We acknowledge the modeling groups, PCMDI and the WCRP’s Working Group on Coupled Modeling (WGCM) for their roles in making the WCRP CMIP3 multimodel dataset available. The Office of Science, U.S. Department of Energy, provides support for the CMIP3 dataset. Model simulations were carried out on the supercomputing facilities of the German Climate Computation Center (DKRZ) in Hamburg.
Current affiliation: European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom.
Current affiliation: Institute for Meteorology, Universität Leipzig, Leipzig, Germany.