Over the past decade or so, there has been much debate regarding how to evaluate climate model simulations with the goal of differentially weighting these results.
The goal emanates from the belief that all climate models are not equal, meaning they do not all provide the same quality of information, particularly about future climate. This issue is most often discussed in terms of how well the models simulate the current (and past) climates. Most recently, in a chapter of the Intergovernmental Panel on Climate Change (IPCC) Working Group I Report, the authors indicated that the climate community does not know how to weight models for determining the best possible projections of future climate. This is certainly true for both global climate models (GCMs) and regional climate models (RCMs). Given the potential importance of regional projections for developing adaptation plans, it behooves us to consider approaches toward model evaluation that depart from standard quantitative methods that have failed to clearly distinguish among different model simulations. To that end, we embarked on the study reported here.
In many contexts where scientific knowledge is evolving rapidly yet policy decisions must still be made, seeking the judgments of experts (expert elicitation) can be useful for obtaining a more comprehensive understanding of the current state of a scientific field. Here we contrast the information provided by expert elicitation with the information provided by metrics. One of the valuable outputs of expert elicitation is the diversity of expert opinion. It is particularly useful in uncovering where experts agree or disagree and why.
Our goal in conducting this pilot study was to explore the value of evaluating RCMs [using the RCM simulations of the North American Regional Climate Change Assessment Program (NARCCAP)] via an expert judgment (EJ) approach and to compare these results to more traditional approaches (e.g., calculating specific metrics).
BACKGROUND ON NARCCAP.
NARCCAP was developed to 1) explore the uncertainties in regional climate change based on a set of GCMs driving a set of RCMs; 2) provide climate change scenarios for use by the impacts and adaptation communities; and 3) provide further evaluation of RCMs. Four different GCMs were selected to drive six different RCMs for a recent period (1971–2000) and a period in the mid-twenty-first century (2041–2070). Table 1 provides brief descriptions of the six RCMs. Only one future greenhouse gas emissions scenario was used (the SRES A2, which is a relatively high scenario.) Also, each RCM was driven by reanalyses (roughly equivalent to observations) for a 25-year period (1980–2004). The domain covers most of North America. Simulations were performed at a 50-km spatial resolution. In the current study, only the RCM results driven by the reanalysis were examined over the southwest region, which is dominated by the North American Monsoon (NAM) system for the period of June through September. The NAM is a complex circulation system made up of well-defined large-scale to mesoscale seasonal circulation features. It is responsible for more than half of the annual precipitation that falls in northwestern Mexico and southern Arizona, and significant quantities in the rest of the southwest United States as well.
Regional Climate Models included in NARCCAP.
CONDUCTING THE SURVEY AND EXERCISE.
Participants.
As a first attempt at incorporating EJ approaches for model evaluation, we used a convenience sample of seven participants available for this exercise. In contrast to survey studies where large numbers of participants are needed for statistical power, EJ studies draw from small pools of concentrated expertise. Thus, EJ participant numbers typically range from 8 to 15. All participants in our study were attendees of the October 2010 NARCCAP co-PIs’ Meeting held at the National Center for Atmospheric Research in Boulder, Colorado. Participants included all co-PIs present who were engaged in performing the simulations with the RCMs for NARCCAP. These scientists worked with five of the six NARCCAP RCMs. We wanted the participants of this pilot study to be experts as atmospheric scientists and very familiar with regional climate modeling and its capabilities over North America. In-depth knowledge of the monsoon, specifically, was not required, though most indicated that they did have expertise in the climate of the region. (At the beginning of the presentation of Survey 2, an overview of the major climatological aspects of the region during the Monsoon season was presented; see supplementary material 4.)
How the survey and exercises were conducted.
Participants were given two exercises to complete. The first was a survey (see supplementary material) that generally asked participants to disclose what expertise they individually felt they had, both for North American subregions and atmospheric processes. Four of the seven participants indicated that they had expertise in the climate of the relevant subregion. The first survey also asked participants about their attitudes (measured with a five-point Likert scale) toward the use of GCM and RCM simulations. Potential applications of model simulations ranged from interpreting likely future climate in a region to using the simulations to justify the allocation of funds for implementing adaptation plans. Participants were also asked how the combination of single versus multi RCMs nested within single versus multi GCMs would affect their beliefs about how climate model simulations should be used.
The second survey (see supplementary material) was administered immediately after the participants had filled out Survey 1. This survey was part of an exercise where the participants were to view a sequence of results from the NARCCAP RCMs in comparison with observations, and to evaluate the quality of the different models’ simulations of these variables. The RCMs were presented “blind,” where the participants were not primed in any way to associate the simulation outputs with any particular model. Five different variables were considered: 2-m temperature, precipitation, wind speed and direction, specific humidity, and moisture flux (a combination of winds and specific humidity). In addition, plots of the interannual variability of some of these variables were considered, and both annual average and average during the NAM season (June–September) were shown for temperature and precipitation. Other variables were restricted to the monsoon season. Figure 1 displays sample results for the variable moisture flux. (See the supplementary material for the full set of slides presented to participants.)
The entire exercise (including filling out Surveys 1 and 2) lasted about 3 h. There was also a discussion of the survey results for about 2 h the next day.
SURVEY RESULTS.
Survey 1.
We first summarize a couple of the key responses about how the participants viewed evaluation of RCMs (questions 9 and 10). Responses to question 9 on general evaluation were varied and included the importance of evaluating performance for all variables; evaluating temperature and precipitation first and then going on to other variables; evaluating bias in seasonal means, interannual variability, and extremes of temperature and precipitation; evaluating relationships between the variables; and primarily evaluating the spatial and temporal variability of precipitation. Thus, from the beginning we see different perspectives among the participants on what they look for in evaluating simulations. When considering the effect of purpose to which the simulations will be applied on approach to evaluation (question 10), two respondents indicated that the purpose made no difference, one had no opinion, and three felt that they would be more particular about the evaluation when the purpose was for adaptation planning. One also felt that more information on uncertainty would be required for use in adaptation contexts.
There was considerable consensus among the participants regarding the credibility of information about future regional climate under future radiative forcing based on projections from multiple RCMs nested in multiple GCMs. Given the participants’ involvement in NARCCAP, such consensus may constitute a forgone conclusion. Presumably, since they are participating in a multi-GCM-RCM program, they would see high value in multi-GCM-RCM results. However, we also asked how model outputs should be used for adaptation planning and implementation. Participation in NARCCAP would not necessarily presuppose attitudes for the appropriateness of using such projections for decision support. We nevertheless found that study participants were equally comfortable with results from multi-GCM-RCM projections being used for scientific research or adaptation planning (average score of 1.4, where 1.0 = “Strongly Agree”). However, in response to the statement, “Projections from a single RCM nested within a single GCM can provide credible information to decision makers about how funds should be allocated for implementing adaptation plans,” most study participants strongly disagreed (average score of 4.7, where 5.0 = “Strongly Disagree”).
The primary issue of contention is the reliance on simulations from a single RCM nested within a single GCM. Otherwise, the pattern of responses regarding credible information for decision-making (with respect to the allocation of funds for implementing adaptation plans) was such that more confidence was placed in simulations based on multiple RCMs nested within multiple GCMs. Nevertheless, there was one holdout, who believed that even this information would not be credible for allocation decisions. The pattern of responses suggests that at the time of this survey, a strong distinction was made between adaptation planning and implementation by the regional modelers. In hindsight it might have been interesting to ask what information would be adequate for actually implementing adaptation plans, understanding that none of the modelers are experts in adaptation planning or implementation.
In contrast, statements that produced clear dissensus among the seven participants [where three answered on the side of (dis)agreement, while an equal number or the remainder answered on the opposing side] include:
“Projections from a single RCM nested within a single GCM can provide credible information about future regional climate under future radiative forcing” (four somewhat agreed, three somewhat/strongly disagreed)
“Projections from a single RCM nested within multiple GCMs can provide credible information to decision makers about how funds should be allocated for implementing adaptation plans” (three somewhat agreed, three somewhat/strongly disagreed, one reported no opinion)
Altogether, the survey participants generally reported that the most credible information for any use (for general information on future regional climate change, adaptation planning, or for resource allocation for adaptation implementation) would come from multiple RCMs nested in multiple GCMs. There was some tolerance for the credibility of information from one RCM nested within multiple GCMs, but only for projecting future regional climate change or for adaptation planning. However, there was a clear difference of opinion on whether projections from one RCM nested within one GCM would provide credible information—even when no policy planning or decisions would be made.
Survey 2.
For Survey 2, the participants evaluated the models based on their perception of how well the models reproduced five different variables. Ratings for each variable were based on a 1–10 scale, with 1 being very poor reproduction and 10 being excellent reproduction. This means that the total highest score for each variable was 70, as all seven participants were involved in evaluating the variables. We were also interested to see, when summing the scores for evaluation of the individual variables, how much variation there was in the rankings of the seven participants. Only three of the participants were willing to provide an overall qualitative ranking of the models, even though all seven were willing to evaluate the models’ reproduction of individual variables. This divide in preferences for how to participate in model evaluation is interesting in and of itself. Four participants did not believe in ranking models. Of these four, one participant did not feel he/she had sufficient expertise for the NAM region to provide a meaningful evaluation, and another felt that the region was too small to be appropriately evaluated.
Table 2 provides the summary for the quantitative results regarding the evaluation of the variables of Survey 2. For most variables, there was no commonality across the models (e.g., that a model that did well in reproduction of temperature also did well in reproduction of precipitation). Only two models ranked highest for more than one variable: Model F for precipitation and specific humidity, and Model C for winds and moisture flux. Also, two different models ranked lowest for two variables: A, for precipitation and specific humidity, and B for winds and moisture flux. The best model based on the sum of variables score was C. It should be noted, however, that Models F and C had values quite close to one another. The lowest-ranked model was A but closely followed by Model B. Note, however, that there are only 10 percentage points between the highest and lowest ranked models.
Rating and ranking results.
Because of the very small sample size (3) for the overall qualitative EJ (QEJ) analysis, we do not compare those results quantitatively with the individual variable analysis of the three participants. We do provide a few observations about these results, however. For the overall QEJ approach, the participants also used a rating from 1 to 10, but the range of rankings for the QEJ approach is very narrow, with all participants using only three or four different values. Thus, a number of models are given the same ranking. Interestingly, one of the models selected as the best by one participant (Model B) was selected as one of the worst by the other two participants, indicating that each considered different biases to be more or less important. While this result can only be suggestive, it does indicate that further experiments with a larger collection of participants may be worthwhile.
Several participants in the discussion the next day expressed surprise that Model A was rated so low under both systems of evaluation since this model ranked relatively high for both 10-m winds and moisture flux. In the combined variable evaluation, we weighted the variables equally for the summing, whereas some of the participants commented that getting the correct wind reversal going up the Gulf of California should be viewed as the quintessential characteristic of the monsoon and should perhaps be weighted more heavily. This EJ regarding which biases are most important for future climate projections in this region is also reflected in the wide variety of responses to question 6 in Survey 1. Some felt that an in-depth analysis of monsoon system processes and features was most important (as was later completed, see For Further Reading), another that winds and soil moisture in summer as well as the thermal low were most important, while another responded that moisture flux in JJAS was most important.
EXPERT JUDGMENT VERSUS PERFORMANCE METRICS.
The results of the expert judgment contrast somewhat with the ranking of the RCMs produced using a set of generally applicable quality metrics that were initially developed for weighting the European ENSEMBLES set of RCM simulations (see For Further Reading). This set consists of five metrics that measure an RCM’s ability to reproduce different precipitation and temperature statistics, in terms of mesoscale spatial pattern, extremes, daily probability distributions, trends, and annual cycle, and one metric measuring their ability to capture large-scale weather regimes. The large-scale metric will be disregarded here, as it is calculated for all of North America and not just the Southwest region. For the southwest United States in summer, the metrics and the experts agree at the high and low ends of the rankings, as Model F receives the most weight and is highly rated by the experts overall, and Model A is weighted and rated lowest. However, Model B is weighted second-highest using the metrics (followed by Models E, C, and D), but is down-weighted by the participants due to its inability to capture some important regional processes. The metrics, however, do not include regionally specific characteristics (e.g., moisture flux from the Gulf of California), as the metrics are meant to be useful everywhere. This highlights the potential usefulness of experts in determining credibility over bulk metrics, or for determining a set of regionally specific metrics for use in weighting.
The results of the EJ are more in line with the results of a more in-depth study of the simulations regarding the North American Monsoon (see “For Further Reading”) conducted later, but there are some differences regarding which RCMs perform best. Specifically, Models C and F are identified as the better simulations in the in-depth study, as well as Model A, as they better simulate some of the monsoon’s defining flow features. The relative importance of the ability of the models to reproduce different processes is an EJ issue in and of itself, and clearly some of the differences in the final evaluation by the three regional modelers were based on differing perspectives regarding how important different biases were to them.
CONCLUSIONS AND GOING FORWARD.
In the discussion period the next day, when the results of the EJ were shown, the participants expressed high interest in the results and in the differences in their perspectives. In these discussions, some suggestions were also made for how this type of exercise could be improved (see supplementary material). Ultimately, most felt that the greatest value in the exercise was going through the process rather than the specific results that were obtained. For example, the participants benefited from the discussions regarding the relative importance of different variables and overall model evaluation, learned more about how each thought about RCM evaluation, and considered more deeply the role of evaluation in the various possible uses of results of simulations.
The most interesting conclusion from this work is that the various participants had different perceptions of the quality of the simulations, based on the individual ratings of the variables and in the overall ratings of the RCMs. It certainly was surprising that among the three participants who were willing to give an overall model evaluation, some simulations were ranked very differently. It was clear that this partially resulted from conscious or unconscious differential “weighting” of the variables or combinations of variables in the overall evaluation.
This study suggests that research focusing on developing general-use metrics is incomplete. It appears to be important to apply differential weightings to metrics due to important regional processes. Unless there is community guidance on what appropriate differential weights for different phenomena should be, scientists will subjectively determine appropriate weights for themselves. That scientists already do this, and that such subjective weightings can lead to different model evaluations, came as a surprise to our participants. Thus, further EJ studies of this type may be useful for isolating particular metrics that may be most appropriate for different regional phenomena as well as to determine the most appropriate weights. Helping the scientific community better understand and communicate their different weightings could better direct future research on the improvement of climate models as well as increase confidence in their use for making regional adaptation policy decisions. Finally, we think our preliminary results warrant fuller exploration of EJ approaches within the context of NARCCAP and other regional climate modeling programs such as the Coordinated Regional Climate Downscaling Experiment (CORDEX).
ACKNOWLEDGMENTS
We thank the seven individuals who participated in this study, as well as Seth McGinnis, who took detailed notes during the exercise. We also thank the two anonymous reviewers who made many useful suggestions for improving the manuscript.
FOR FURTHER READING
Adams, D. K., and A. C. Comrie, 1997: The North American Monsoon. Bull. Amer. Meteor. Soc., 78, 2197–2213, doi:10.1175/1520-0477(1997)078<2197:TNAM>2.0.CO;2.
Aspinall, W., 2010: A route to more tractable expert advice. Nature, 463, 294–295, doi:10.1038/463294a.
Bukovsky, M., D. Gochis, and L. O. Mearns, 2013: Towards establishing NARCCAP regional climate model credibility for the North American Monsoon: Current Simulations. J. Climate, 26, 8802–8826, doi:10.1175/JCLI-D-12-00538.1.
Bukovsky, M., D., J. Thompson, and L. O. Mearns, 2013: The effect of weighting on the NARCCAP ensemble mean. 25th Conf. on Climate Variability and Change, Austin, TX, P106.
Bukovsky, M., D., J. Thompson, and L. O. Mearns, 2016: The effect of weighting on the NARCCAP ensemble mean: Does it make a difference? Climate Res., to be submitted.
Christensen, J. H., E. Kjellström, F. Giorgi, G. Lenderink, and M. Rummukainen, 2010: Weight assessment in regional climate models. Climate Res., 44, 179–194, doi:10.3354/cr00916.
Flato, G., and Coauthors, 2013. Evaluation of climate models. Climate Change 2013: The Physical Science Basis, T. F. Stocker et al., Eds., Cambridge University Press, 741–866.
Mearns, L. O., and Coauthors, 2012: The North American Regional Climate Change Assessment Program: Overview of phase I results. Bull. Amer. Meteor. Soc., 93, 1337–1362, doi:10.1175/BAMS-D-11-00223.1.
Mearns, L. O., M. Bukovsky, S. Pryor, and V. Magaña, 2014: Downscaling of climate information. Climate Change in North America, G. Ohring, Ed., Springer, 201–250.
Morgan, M. G., 2014: Use (and abuse) of expert elicitation in support of decision making for public policy. Proc. Natl. Acad. Sci. USA, 111, 7176–7184, doi:10.1073/pnas.1319946111.