We discuss the current code of practice in the climate sciences to routinely create climate model ensembles as ensembles of opportunity from the newest phase of the Coupled Model Intercomparison Project (CMIP). We give a two-step argument to rethink this process. First, the differences between generations of ensembles corresponding to different CMIP phases in key climate quantities are not large enough to warrant an automatic separation into generational ensembles for CMIP3 and CMIP5. Second, we suggest that climate model ensembles cannot continue to be mere ensembles of opportunity but should always be based on a transparent scientific decision process. If ensembles can be constrained by observation, then they should be constructed as target ensembles that are specifically tailored to a physical question. If model ensembles cannot be constrained by observation, then they should be constructed as cross-generational ensembles, including all available model data to enhance structural model diversity and to better sample the underlying uncertainties. To facilitate this, CMIP should guide the necessarily ongoing process of updating experimental protocols for the evaluation and documentation of coupled models. With an emphasis on easy access to model data and facilitating the filtering of climate model data across all CMIP generations and experiments, our community could return to the underlying idea of using model data ensembles to improve uncertainty quantification, evaluation, and cross-institutional exchange.
In constructing climate model data ensembles, an argument is made for defaulting to cross-generational ensembles that use all available model data except when constrained by observations.
In climate science, the use of an ensemble of simulations from multiple models performing a common experiment has become a traditional means to represent, estimate, and average model uncertainties and errors. Climate model development and system understanding has been advanced by the series of Coupled Model Intercomparison Projects (CMIPs), initiated by the Working Group on Coupled Modeling of the World Climate Research Program (WCRP). The third-generation CMIP3 was tailored to answer scientific questions relevant to the scientific assessment for the Fourth Assessment Report of the International Panel on Climate Change (IPCC; Solomon et al. 2007). CMIP5 is the newest, most comprehensive set of experiments (Taylor et al. 2012), helping in the assessment of current model capabilities and yielding results that are used in the IPCC’s Fifth Assessment Report (AR5; IPCC 2013). Throughout the history of the CMIP phases, it has been common practice to assume that the different phases deal with models sufficiently differently to separate existing model simulations into generational ensembles of opportunity. The climate research community is now in the process of discussing the next phase, CMIP6. In this article we argue, a probably well-known but inconvenient fact, that the ad hoc separation into generational ensembles is mostly due to historical and implementation convenience and has no physical meaning for the representation of key climate variables such as global-mean surface temperature and precipitation characteristics for CMIP3 and CMIP5. Our community must do better in constructing suitable and robust model ensembles than using arbitrarily sized ensembles of opportunity.
CMIP generational ensembles are currently per construction such ensembles of opportunity. Some institutions synchronize their model development cycle to the CMIP cycle and release newer versions of a given model for the next generational ensemble, for example, the Max Planck Institute for Meteorology’s ECHAM5 for CMIP3 and ECHAM6 for (Stevens et al. 2013), formally indicating generational improvement of the ensembles. At the same time, though, new institutes join the intercomparison projects with new models, which can be seen from 59 models for the historical runs in CMIP5 [the number of models that can be downloaded from the Program for Climate Model Diagnosis and Intercomparison (PCMDI) Earth System Grid (ESG) node as of 19 June 2014] as compared to 22 models for CMIP3 (Randall et al. 2007), while some others have used the same model version for multiple phases of CMIP. There are no minimum quality requirements to join the ensemble; the model has only to be documented and referenced in the peer-reviewed literature, to run, and to adhere to the experimental protocol data requirements (Taylor et al. 2012). Despite ongoing research (Knutti 2010; Knutti et al. 2010a,b), there is no consensus on how quality or skill measures could be used to reduce the ensemble size. Some work does suggest that dismissing the poorer performing models can improve overall ensemble quality (Matsueda and Palmer 2011), but an appropriate set of criteria is almost certainly application dependent. The most recent generational ensemble is typically the de facto choice for research, a consensus based mostly on practical reasons. The two most relevant practical reasons are that because of changes in experimental protocols, only members of the same generational ensemble share comparable—though not identical—forcings, and that there are different computational archives for each generational ensemble. We believe that this shared practice of constructing generational ensembles is equivalent to the acceptance of an implied hypothesis: it is assumed that newer-generation models are automatically and systematically better than their predecessors, even though the CMIP process does not structurally necessitate this. In this article, we present an argument for a different default ensemble construction: use all CMIP models that exist for a given experiment, across generational and—potentially—forcing differences, and then work hard to deliberately constrain these ensembles.
The CMIP process has been successful in harmonizing global climate modeling efforts and facilitating intercomparison that has substantially improved understanding. But it has also some inherent deficiencies: the time frame of the CMIP phases is not synchronous with modeling development cycles, leading to problems with specific CMIP model versions not being ready yet or being rushed. Additionally, CMIP does not incorporate sufficiently standardized information on questions of calibration and evaluation. It has also been suggested by the reviewers of this essay that the amount of harmonization between the modeling centers could even be seen as too successful to warrant a sufficient number of independent models. In this article, we present some arguments that support the process that led to the CMIP modeling community considering a shift in the way the project is implemented (Meehl et al. 2014). In support of this initiated transition, we describe a rethinking of the Coupled Model Intercomparison Project that incorporates the long-term need for documentation, calibration, and evaluation, as well as the necessity of flexibility for scientific experiments and the ongoing incorporation of new scientific insights. Our suggestions are based on the way that we believe climate model data ensembles should be constructed, with our structural recommendations following naturally.
This essay reflects the authors’ experiences from handling the CMIP ensembles for AR5 and is meant to stimulate discussion. The discussions that led to this essay are not part of the official CMIP discussion within WCRP. In the following, we use some figures of model data to exemplify the quality of differences between CMIP3 and CMIP5. The main point of this essay is to use these simple plots to support our reformulated default ensemble construction and the corresponding implications for the CMIP process, not to discuss the exact size or cause for these differences. We hope that our opinion as outlined below enriches the ongoing discussion on how to create useful multimodel ensembles and is evaluated independently from the organizational suggestions. After all, CMIP is just an organizational framework of our community to make our underlying scientific work better and easier.
CMIP3 AND CMIP5 ARE QUALITATIVELY TOO SIMILAR IN MEAN STATE AND RESPONSE TO WARRANT AN AUTOMATIC GENERATIONAL SEPARATION.
To show this, we use model data from the CMIP5 historical integrations and the equilibrium climate sensitivities of CMIP5 (Andrews et al. 2012; Taylor et al. 2012) and CMIP3 (Solomon et al. 2007) models. The most basic if very challenging task of any climate model is to represent the current climate’s surface air temperature and precipitation climatology, as both surface air temperature and precipitation dominate the impacts on ecology and society (IPCC 2007). A very basic quality measure of near-equilibrium capabilities of current climate models is therefore the mean climatology bias with respect to a chosen set of observations. We use the 1980–2005 climatological reference period adopted for the IPCC AR5 Working Group 1 (WGI) model evaluation. We create additionally a fictitious cross-generational ensemble, which we call CMIP8 and which consists of all corresponding model data of CMIP5 and CMIP3. All model data are interpolated onto a regular 1° × 1° grid. Figures 1a–c show the multimodel mean biases in surface air temperature for CMIP3, CMIP5, and CMIP8, respectively. We see that the bias pattern appears to be unchanged. As to how close the maps could be to zero remains an open question (Annan and Hargreaves 2010). For the 25-yr period, the role of natural variability is largely damped, and the biases shown mostly represent shared structural errors across both generational ensembles. We conclude one major fact from the similarity of the three panels: CMIP5 is not qualitatively better in its ability to represent twentieth-century mean-state climatologies than CMIP3, in the sense that the location and structure of the bias is not fundamentally different, even though the absolute size of the bias is incrementally improved (e.g., Knutti et al. 2013). Figures 1g–i show a similar result for precipitation: CMIP8 and CMIP5 do not seem to be qualitatively better than CMIP3, CMIP5 is better in some regions, and CMIP8 is in between CMIP3 and CMIP5. To put these results into context, in Fig. 2 we show that the remaining bias in CMIP5 is large compared to either the progress between CMIP3 and CMIP5 or to the temperature reanalysis inconsistency as a measure of observational uncertainty (as the average difference between three reanalyses).
Pattern correlations between models and observations are another way of quantifying the capability of models to represent the climate system’s mean state. The mean pattern correlation of CMIP5 is again better for all quantities (Fig. 3) when compared with CMIP3, but, as we can see from the same, the best models of CMIP3 are better than the worst models of CMIP5. The overall ensemble is mostly better because it has lost the worst members of the CMIP3 ensemble. From Fig. 3, it seems more natural to create (arbitrarily sized) top 20 model ensembles from CMIP8 per quantity (called CMIP20 here) than to use only CMIP5. The exact construction of such constrained ensembles is difficult, but here our intent is simply to demonstrate the overlap in the quality of the generational ensembles. To create an effectively constrained ensemble, we argue that one should start from the full cross-generational ensemble with all available model data. A cross-generational CMIP8 ensemble multimodel mean lies between CMIP3 and CMIP5, as is expected from overlapping ensembles, but has larger spread, which might be a benefit for uncertainty analysis in the sense that it covers more of the intrinsic uncertainty as long as we cannot dismiss specific models.
While the mean-state representation is a basic test of climate model performance, these models are used to project the response of the climate system to (anthropogenic) forcing changes. The most basic and yet very important measure of model response is equilibrium climate sensitivity (ECS), the mean surface warming per doubling of CO2. In Fig. 4, we show the absolute global-mean temperatures during a reference period in the twentieth century and the correlation with ECS for CMIP3, CMIP5, and CMIP8, similar to the CMIP3 results in Knutti et al. (2010b). We see a small difference in the spread of absolute mean surface temperatures between CMIP3 and CMIP5, with CMIP5 being slightly less diverse. At the same time, we observe that the spread of modeled climate sensitivity is similar. The main functional conclusion—that current global climate biases are not directly correlated to climate sensitivity—does not change between CMIP3 and CMIP5 either. For CMIP3, CMIP5, or CMIP8, if you view at all data points, there is no simple correlation between a warmer mean global historical state and a higher response to increased greenhouse gas concentrations. Additionally, it is easily seen, again, that CMIP3 and CMIP5 are overlapping ensembles. Even if specific metrics exist that separate CMIP3 and CMIP5 to further detail, in their basic response properties and the relationship of those properties to basic representation properties both ensembles are not qualitatively different. We do not further discuss ongoing attempts to constrain climate sensitivities from physical constraints (Caldwell et al. 2014; Fasullo and Trenberth 2012; Hall and Qu 2006; Sherwood et al. 2014) because this is not part of the argument of this essay: the spread of the most fundamental response property and its relationship with the most basic mean-state property seems to be basically unchanged between CMIP3 and CMIP5.
These findings do not contradict results that show that CMIP3 and CMIP5 are not identical and that CMIP5 is the better generational ensemble. We acknowledge the wealth of ongoing research that shows improvements to CMIP5 in a variety of components, regions, or metrics of the climate system (Flato et al. 2013). The missing qualitative separation of the full ensemble—the fact that there is large overlap—for some quantities between CMIP3 and CMIP5 implies, however, that our ability to model the climate system has not changed drastically in the last decade, even across different assumptions and with different models across different generations of high-performance computers. Unfortunately, we have not been able to tackle long-standing structural problems in representing the coupling of wet processes and dynamics in our climate models, as is also reflected in the WCRP’s Grand Challenges for climate models (Bony and Stevens 2012). As a result, CMIP5 and CMIP3 do not differ enough in their basic mean-state and response properties; they overlap too much to warrant an automatic generational separation.
CONSTRAINED CROSS-GENERATIONAL ENSEMBLES REFLECT A NATURAL ENSEMBLE CONSTRUCTION METHOD.
To us, for ensemble construction, the natural default seems to use all available model data, that is, the maximum number of ensemble members for all research questions whenever there is no consensus on evaluation-constrained ensembles. This approach is based in part on scientific–philosophical reasons: if we cannot constrain or cull one model or model–forcing combination on evaluation grounds, then we should not do it. Additionally, as long as we cannot constrain the full ensemble, we cover the underlying uncertainty better if we use all available models. If the underlying models share systematic biases as do the CMIP3 and CMIP5 models (see Fig. 1), then the increased ensemble size will most likely not be very efficient in creating an ensemble spread that covers the full underlying uncertainty, but increasing ensemble size via a cross-generational ensemble could lead to an, at least slightly, increased number of independent model–forcing combinations. Cross-generational ensembles will emphasize the influence of models that exist across different CMIP generations. It could be argued that it is reasonable to reward long model development experience, but if this is not a desired property, then it could be circumvented by creating the cross-generational ensemble with one member per institution, or filters could be developed according to current research on model independence (Knutti et al. 2013).
The twentieth-century simulations used above to create one cross-generational ensemble have been performed with differing forcings in CMIP3 and CMIP5. A priori, both sets of model–forcing combinations represent an equally likely measure of how well the models can represent the current state of the climate system: the forcing assumptions for newer CMIP generations might be more realistic in the number of represented processes, but we do not have a quantitative criterion for quantitatively assessing the likelihood of a given model–forcing combination. The default should therefore be to value all model–forcing combinations as equally probable and not to arbitrarily weight old model–forcing combinations with a likelihood of zero as is implicitly done in generational ensembles. The difference in forcing can lead to bigger spread for some quantities, which can be seen as an advantage when the aim is to capture the full uncertainty of our modeling efforts. To calculate ECS, a similar approach to that underlying the twentieth-century simulations can be applied: the cross-generational ensemble is constructed from two different experiments (abrupt 4 × CO2 and abrupt 2 × CO2) in CMIP5 and CMIP3, respectively, even using fundamentally different model setups with respect to the ocean model component. A way to estimate ECS comparable to the CMIP3 ECS values from the CMIP5 4 × CO2 experiment has been discussed in Andrews et al. (2012). A cross-generational ensemble of ECS estimates is after all still essentially an ensemble of estimates for a key quantity of the climate system that are comparable, even if the ensembles of the simulations are not. For projections, different forcing can make cross-generational ensemble construction more difficult than for the evaluation experiments above, although there are approaches to do so (Knutti and Sedlacek 2013). We recognize that fundamental forcing differences/developments for some experiments will make the idea of fluid ensemble creation difficult and in some very specific cases impossible. We argue, though, that the resulting forcing–model combination is not a priori better and should be compared when possible to the full set of the cross-generational ensemble.
A new default cross-generational ensemble construction is the natural replacement for the old generational ensemble only if we cannot constrain the ensembles based on scientific arguments. We can think of three cases where there are scientific arguments to reduce the size of the ensembles. First, the analysis of model responses to a specific forcing experiment might still require an ensemble contraction of all available models to those that have been run with this exact set of forcings; for these types of experiments, ensembles will continue to be generated as generational ensembles. Second, if only a limited number of ensemble members represent a specific physical aspect of the earth system, the use of a full cross-generational ensemble does not make sense. An example for this case is the increase in high vertical resolution that allows for representation of the quasi-biennial oscillation (QBO). This problem also occurs in generational CMIP ensembles, such that only a handful of the new CMIP5 models can represent the QBO (Schmidt et al. 2013). Therefore, in these cases it is natural to create a subgroup of models with comparable features and to generate a target ensemble for the specific science question at hand. The model quality for the specific quality is “binary”: the model either represents the process or it does not, leading to a natural way to create the target ensemble. Third, and the most difficult case, it is rationally desired and possible that evaluation-constrained target ensembles will be used where either a number of best models are selected from the cross-generational ensemble or a number of worst models are discarded. We have created a simple example in this article by creating an arbitrarily sized CMIP20 ensemble for specific quantities. The details of target-ensemble constructions are determined by relevant simulation quality criteria (e.g., a specific forcing or realistic QBO) or model properties that target a specific scientific question.
THE CASE FOR FLUID, CROSS-GENERATIONAL ENSEMBLE CONSTRUCTION AND CMIP6/DECK.
Some of our arguments are so basic that they have been in discussion for years, and yet the generational ensemble has remained a default even in the IPCC process. We believe that a renewed explicit discussion of the types of ensembles that we use to do our work is highly relevant and that this discussion should also be directed toward the future of CMIP. The requirements of our suggested ensemble construction for a CMIP process boil down to two points: easy access and easy filtering via controlled experimentation for better science. The rest is the responsibility of the involved researchers. Concerning easy access, there should be one central portal to all CMIP data across all generations, sorted for a given scientific question, and technical and resource problems need to be overcome, particularly given that each new CMIP generation produces much more data than the one before (so far anyway). Concerning easy filtering, this requires the development of a standardized generic and broad set of qualities of a model that can be used to filter. These qualities should incorporate structural information on models and a list of quality parameters that enable target ensembles when scientific insight allows for it. The current discussion within the WCRP community has led to the paper by Meehl et al. (2014) describing an experiment design for a sixth phase of CMIP that is fundamentally different from all previous phases. This new setup is largely consistent with our recommendations for a fluid, cross-generational ensemble construction with the establishment of an ongoing CMIP certification type of process, called Diagnosis, Evaluation, Characterization of Klima (DECK). A standardized CMIP documentation and evaluation framework could enable continuous evaluation and documentation of new model versions and could make it easier to redo model evaluation attempts with newer simulations and updated observational datasets. The evaluation experiments should incorporate most basic properties of a model to allow for an effective filtering in a broad variety of application cases (i.e., steady state for the atmosphere–ocean coupled system, as well as time-varying responses in twentieth-century and CO2 change experiments). If the DECK framework could be established in a sustained manner, then the current problem of differences in time scale between model development and experiment requirements could be ameliorated. Additionally, the time scale and ensemble size of generational IPCC-related scenario runs would be completely independent of the default climate ensemble construction for other science questions. A word of caution: we do not believe that CMIP can or should prescribe scientific methods to the community, but it can help in avoiding and reducing the practical impediments that have hindered our science for some time. The transition toward the establishment of the CMIP DECK experiments represents an opportunity to move beyond generational ensembles and establish a community-based capacity for application-dependent filtering. The results presented in this article support the DECK approach, and we hope that our suggestions will be useful for this planning.
The climate modeling community should not focus on constructing generational ensembles, neither from the current CMIP5 nor from a potential future CMIP6. We suggest eliminating the idea of generational ensembles because generational ensembles are not scientifically justified given the current rate of model development and progress. If quality-constrained ensembles result in generational ensembles in the future, then this might change again, but for now ensembles should be constrained by physical reasoning and uncertainty analysis or not at all. We suggest returning to the original idea of CMIP: to understand climate models, including their differences and properties, and to quantify uncertainty. We believe that, at this point, the CMIP process should lead to a continuous, iterative increase in ensemble size and quality to address scientific questions. The long-term aim of our community should be to replace the discrete steps of ad hoc generational ensembles with a fluid, scientifically more sound process of constrained target-ensemble generation for specific problems, based on results from a standardized set of documentation and evaluation experiments. As long as those constrained ensembles are not available or justified, the community should construct ensembles that incorporate all available model data. To us, the need for clear communication and explanation of why one uses a specific ensemble for a given task appears to be essential. We believe that a reformulated CMIP process could be an important organizational advancement toward achieving this more natural way of constructing multimodel climate ensembles. Even before we get to that reorganization of CMIP, we believe that we should start to spend more time and effort to construct carefully designed ensembles right now.
We acknowledge funding from the German ministry for research and education (BMBF, FKZ 01LG1005C) and the Max Planck Society. Part of this work was performed by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, with funding from the U.S. Department of Energy, Office of Science, Climate and Environmental Sciences Division, Regional and Global Climate Modeling Program. We acknowledge all modeling centers, the WCRP’s Working Group on Coupled Modelling (WGCM), and the partners of the Earth System Grid (ESG) for their roles in making available the WCRP CMIP3 and CMIP5 multimodel dataset. We also thank the three reviewers who have provided us with many suggestions and helped us a lot in refining the points that we wanted to make in this article.