Climate modeling groups from four continents have submitted simulations as part of phase 5 of the Coupled Model Intercomparison Project (CMIP5). With climate impact assessment in mind, we test the accuracy of the seasonal averages of temperature, precipitation, and mean sea level pressure, compared to two observational datasets. Nondimensional skill scores have been generated for the global land and six continental domains. For most cases the 25 models analyzed perform well, particularly the models from Europe. Overall, this CMIP5 ensemble shows improved skill over the earlier (ca. 2005) CMIP3 ensemble of 24 models. This improvement is seen for each variable and continent, and in each case it is largely consistent with the increased resolution on average of CMIP5, given the correlation between scores and grid length found across the combined ensemble. From this apparent influence on skill, the smaller average score for the 13 Earth system models in CMIP5 is consistent with their mostly lower resolution. There is some variation in the ranking of models by skill score for the global, versus continental, measures of skill, and this prompts consideration of the potential influence of a regional focus that model developers might have. While some models rank considerably better in their “home” continent than globally, most have similar ranks in the two domains. Averaging over each ensemble, the home rank is better by only one or two ranks, indicating that the location of development is only a minor influence.
Simple skill scores for CMIP5 and CMIP3 climate models are analyzed for each continent and the globe.
Projections of climate changes resulting from past and anticipated anthropogenic forcings, such as those from Meehl et al. (2007a), have been largely based on simulations by comprehensive models of the coupled ocean–atmosphere–land system global climate models (GCMs). Phase 5 of the Coupled Model Intercomparison Project (CMIP5), protocol described by Taylor et al. (2012), is the latest set of experiments in which climate modeling groups worldwide have participated and for which their results have been publically distributed. Both CMIP5 and the earlier phase, CMIP3, include experiments for the coming century under various scenarios as well as “historical” (1850–2005) simulations (or runs), with best estimates of historical forcing.
Together with other authors, Randall et al. (2007) evaluated the CMIP3 climate models with the climate change application in mind. They made an extensive comparison of the simulations with observations from the past few decades, including processes such as those that control climate sensitivity to forcing, but also basic features of the climate system like temperatures and circulations in the atmosphere and ocean. Applying objective metrics to the comparisons of such data, Gleckler et al. (2008) found a considerable range of skill across the CMIP3 ensemble in representing atmospheric and surface features over the globe. Reichler and Kim (2008) accumulated skill scores into a single metric, based on which it was argued that model performance has improved over successive CMIP phases.
Watterson (2008) also provided an overall skill score for 23 CMIP3 models in the context of projections of climate change over Australia. The analysis was focused on a small set of variables related to climate impact—surface air temperature (tas in the CMIP5 nomenclature), mean sea level pressure (psl), and precipitation (pr), and to means over each of the four seasons [December–February (DJF), March–May (MAM), June–August (JJA), and September–November (SON)]. Whetton et al. (2007) showed that these skills measured regionally had some bearing on the regional changes in various continents. Subsequently, Watterson and Whetton (2011) applied the same skill tests to a global land domain. The better performing models largely matched those of Reichler and Kim (2008).
During the development of a new coupled model, ACCESS (see Table 1 for expanded model names), skill scores were used to demonstrate that the model performs better over both Australia and the globe than the previous Australian model, CSIRO Mk3.5 (see Bi et al. 2013). Naturally, the skills of all models over Australia are of considerable interest to those undertaking climate projections for Australia, and one aim of this study is to extend this type of assessment to all the continents (excluding Antarctica), as well as for the global land. Continental domains also feature in the Coordinated Regional Climate Downscaling Experiment (CORDEX), which is linked to CMIP5 (see Jones et al. 2011).
What factors influence the variation in skill of the present climate among models? Randall et al. (2007) considered the many improvements in model formulation over time. Improved simulation of energy fluxes has allowed most recent models, including all those in CMIP5, to avoid the need for artificial flux adjustments. Treatments for some biogeochemical processes that control the future concentrations of some important greenhouse gases are now included in some CMIP5 models, which are denoted as Earth system (ES) models to distinguish them from atmosphere–ocean (AO) models in which concentrations are specified. However, as discussed by Gleckler et al. (2008), flux adjustments and other constraints can produce a superficial improvement in present climate skill. Both Gleckler et al. (2008) and Randall et al. (2007) noted the improved resolution of models, as computing power has increased. Increasing resolution of an individual model tends to improve skill (see, e.g., Pope and Stratton 2002), particularly in extremes of rainfall (Wehner et al. 2010) and wind speed, among other quantities (Kinter et al. 2013). Reichler and Kim (2008) attributed some of the climatological improvement over successive CMIP phases to resolution, but did not quantify this effect. We make an assessment of the extent that horizontal resolution relates to the skill for three variables across the combined CMIP3 and CMIP5 ensemble. We also compare the ES and AO categories of CMIP5.
Consideration of a further potential influence on the scores within the continents was spurred by a particular interest in regional performance. Alluding to the “home advantage” often enjoyed by sporting teams, do global models tend to perform better in their “home” continent—that is, where they were developed? As Jakob (2010) describes the process, it would be normal for model developers to have improved performance for the intended application in mind. It is conceivable that merely by having a regional focus in evaluation (e.g., Chylek et al. 2011; Scaife et al. 2011; Watterson et al. 2013), it would be less likely that a code change or a “tuning” of parameters (Mauritsen et al. 2012) would be made that is to the detriment of the home performance. A thorough consideration of the issue might address the development approach of each modeling group, and even their aims and motivations. The scope here is a much more limited assessment based on the continental skill scores, partitioned by the four continents from which models have been submitted to CMIP5 and CMIP3. Specifically, we compare the ranking of models for their home with their global rank. Some of the detailed results, especially for CMIP3 models, are shown in the online supplement.
DATASETS AND METHODS.
CMIP5 includes a range of climate experiments (Taylor et al. 2012) and has attracted submissions from over two dozen modeling groups in at least 13 countries (see cmip-pcmdi.llnl.gov/cmip5 for details). Our focus is on global coupled models used in both historical and long-term climate experiments, from which suitable data were available on the National Computational Infrastructure (NCI, Australia) data portal in early 2013. This produced an ensemble of 25 models, as listed in Table 1 (with the order to be explained later). The horizontal resolution of each model is included, as a single, representative gridbox length for the land surface. This length is defined as the square root of the quotient of the Earth's surface area and the number of points on the model's data grid. Of the 25 models, 13 are taken to be an Earth system model (ESM). However, this may have a limited effect on the standard historical runs in which the greenhouse gas concentrations are specified from observational data. Some groups have produced two or more versions of a model, sometimes with a reduction in resolution for the ES cases (e.g., the MIROC group from Japan).
Within each model category, along with differing resolution, there are often substantial differences in the components—for instance, ACCESS1.3 has a different land surface and different representation of clouds and some other atmospheric processes as compared to ACCESS1.0 (Bi et al. 2013). There are also common features, such as shared atmospheric or ocean structures. For instance, ACCESS includes versions of the HadGEM atmospheric model developed by the Met Office (Brown et al. 2012), and used in HadGEM2-ES (and HadGEM2-CC; Collins et al. 2011). Even then, the final development of each coupled system may lead to differing regional performance, and the location of the submitting group is potentially important here, rather than that of the developers of the original components. Of these 25 models, 8 are from Europe, 7 from both Asia and North America, and 3 from Australia. In assessing the influence of resolution, it is interesting to consider also MIROC, version 4h (high resolution; MIROC4h; Sakamoto et al. 2012), with a grid length of only 50 km; however, it is not available for extended simulations.
The skill test is limited to a comparison of time-averaged, or climatological, gridded fields of tas, pr, and psl, for the four seasons (DJF, MAM, JJA, and SON). Each historical simulation extends to the nominal year 2005, and averages over the 30-yr period of 1975–2004 have been constructed, with the December–February season extending to February 2005. In these simulations there is no relationship between the weather and unforced variability, even on interdecadal time scales, to the actual variability, so the choice of period for the datasets is not crucial. For the CSIRO Mk3.6 model (Rotstayn et al. 2012), the first of a set of 10 runs was used, with the others analyzed to indicate effects of unforced variability on the results (which are relatively small).
The CMIP3 dataset includes simulations of the twentieth-century climate (20C3M) by coupled models current during 2005/06 (Meehl et al. 2007b) from an ensemble of 24 models (see the supplement, and Randall et al. 2007, for references). We use the same climatological data as Watterson (2008), which were averages over the standard climate period 1961–90.
In our previous CMIP3 global analysis, the observational data for tas and pr were from the Climatic Research Unit (CRU, United Kingdom) land dataset for 1961–90 (New et al. 1999) aggregated to a 2° grid, which includes data for numerous islands (see the supplement). For psl the 40-yr European Centre for Medium-Range Weather Forecasts (ECMWF) Re-Analyses (ERA-40) (Uppala et al. 2005), on a 2.5° grid, were used (after interpolation to the CRU grid). This set will be referred to as Data CRU, or simply CRU. The representative length of the CRU grid is 177 km, while for the 2.5° grid it is 220 km.
The sensitivity of our skill metrics to the choice of the observational data is evaluated through the use of the current reanalysis, the Interim ECMWF Re-Analysis (ERA-Interim; Dee et al. 2011), as a second dataset. Using this for all three variables allows the relationship with resolution to be performed consistently across the three, and the 1.5° data grid, with the representative length of 133 km, is of similar resolution to the higher-resolution models in CMIP5 [Gleckler et al. (2008) note that scores for pr, especially, can vary with the grid]. Climatologies constructed from the 30 years through February 2009 are used here, and denoted as Data ERA or ERA. Points that are designated land are used, as shown in Fig. 1. As a reanalysis, these data can depend on the model used to optimally estimate the global state at each time given observations. ERA-Interim precipitation is generated by the model, but Simmons et al. (2010) show that it compares well with standard alternative datasets over the continents. To explore this we have made a further assessment using pr from the Global Precipitation Climatology Project, version 2.1 (GPCPv2.1; available on a 2.5° grid; Adler et al. 2003), using averages over 1979–2008. We also considered psl from the National Centers for Environmental Prediction (NCEP) reanalysis (Kistler et al. 2001) over 1973–2002.
For our tests we exclude Antarctica and any land south of 60°S, as these would have little bearing on our aims. The remaining land points form the “global land” domain (GL), as shown for Data ERA in Fig. 1. We form six continental domains using simple latitude and longitude boundaries, hence including some islands, and shown with coded colors. To provide some limit to the domains that might be influenced by home advantage, some land is excluded from these six continents (shaded gray). The two-letter code for each continent is marked in Fig. 1. The home continent of each model is indicated by the model code names given in Table 1, with the CMIP5 North American models referred to as NA1–NA7, for instance.
As before, the main metric is the non-dimensional arcsin Mielke measure M (see Meehl et al. 2007a for further uses). In essence it is the mean square error (mse), nondimensionalized by the spatial variance of the field. Specifically, for the mse between the model field X and observed field Y,
where V is the variance and G is the mean, with all statistics calculated over the domain, after X was interpolated to the Y grid. The final factor provides a skill score that has a maximum possible value (for mse = 0) of 1000 “points,” while a zero score indicates no skill (negative values are rare in our context). If the variance and mean of X and Y are the same (and they are usually close here), then M/1000 is equivalent to the standard correlation coefficient r, except that the arcsin transformation (applied to a measure of Mielke 1991) means that the deviation from unity asymptotes to the square root of mse, rather than mse itself. This is useful for fields like tas, where r tends to be close to one.
As an example, the similarity between the two observational datasets, determined on the Fig. 1 grid, and averaged over four seasons, is quantified by M ranging from 920 for tas to 670 for pr, with the three-variable average of 830. Despite the expected greenhouse gas–forced change over the 14-yr shift in period between Data CRU and (the later) Data ERA being a small warming, the differences between the tas data are smaller than those for psl and pr (relative to their spatial variance). It is evident that there are systematic differences between the datasets. To sample the corresponding range in skill scores, we use both CRU and ERA, with some focus on results for CMIP5 using ERA. Consistent with the term rmse, we describe the magnitude of the difference between model and observational values as the model “error.” The shading on the map (Fig. 1) indicates where these errors tend to be modest. At each grid point, the errors (for ERA) have been averaged over the ensemble and the four seasons. The errors for precipitation tend to be large compared to the spatial average error (in terms of mm day-1) in high rainfall regions. The tas and psl errors tend to be large at high latitudes and orography. A criterion of a value less than 110% of the GL domain average for each quantity is used for the shading. Evidently midlatitude regions tend to be better modeled, in this sense. A similar pattern of shading applies for Data CRU and for the CMIP3 ensemble.
For each CMIP5 model, the metric for agreement with observations has been calculated for the seasonal climatologies of tas, pr, and psl initially over the GL domain, producing the scores given in Table 1. The values for Data ERA range from 596 to 755 points, which indicates that all models have a substantial degree of skill. The average step between the ordered values is 7 points, although some scores are closer. The standard deviation of the 10 scores from all runs of CSIRO Mk3.6 is only 1.5 points, which suggests that uncertainty in the scores due to unforced variability is usually of little importance.
The scores for CRU (Table 1) are typically a little higher, partly as a result of the additional island grid points and the reduced resolution of the data grid, as discussed shortly. The corresponding scores for the CMIP3 models range from 521 to 749, a larger range than for CMIP5 overall.
The average scores for all four cases are given in Table 2. The CMIP5 average is 24 points higher than CMIP3 for ERA, and 17 higher for CRU. The combined average change (Table 2) indicates a small improvement (20 points) of overall skill in CMIP5 over CMIP3, but it amounts to only about three model ranks. There are usually improvements between successive models from the same modeling group, and the step for GISS-E2-H (see www.giss.nasa.gov/tools/modelE) is over 100 points. Substantial improvements were also achieved by CNRM-CM5 (Voldoire et al. 2013), GISS-E2-R, and CCSM4 (Gent et al. 2011). However, the higher scores from CMIP3 have not been greatly improved on.
The M values in Table 1 correspond to a typical rmse for the best-performing CMIP5 model being around 40% smaller than that from the least. The mean errors for each variable (used in Fig. 1 shading) and each dataset are also reduced in CMIP5, with the exception of pr and CRU, for which the error is 1% larger than in CMIP3.
We turn to the values calculated for each of the continental domains. For Data ERA, the individual values for four domains are plotted in Fig. 2, where the points are positioned horizontally by the corresponding GL score. For clarity, values for South America (SA) and Africa (AF) are not shown in Fig. 2, but the labeled results for each continent are shown separately in the supplement. In Fig. 2, the red points are the scores for (simplified) North America (NA), for instance, and these all lie under the Y = X line shown, as the NA scores are systematically smaller than those for GL. Scores for the other continents are also typically smaller than for GL. One reason for this is that the spatial variation, which is included as a divisor to mse in the calculation of M, tends to be larger for the larger domain. It is worth noting that the only negative M values (for two models) were for the case of psl in June–August over NA, in which the spatial variation is relatively small. A similar comparison between (three-variable average) GL and continental scores holds for CMIP3.
The average scores over the ensembles for each continent are given in Table 2. For each case the continental scores are smaller than the corresponding GL value. The results also differ between continents, with the CMIP5 scores being lowest for SA and highest for Australia (AU). While the spatial variation across each domain is a factor, the difference is partly associated with the difference in mean errors as evident from the shading in Fig. 1. For each continent CMIP5 improves on CMIP3, often considerably, as in the combined result of 58 points for AU (Table 2).
All these scores are averages across the three variables, with the individual results for GL given in the supplement. As a result of the mse being small relative to the spatial range of tas (which is largely driven by solar radiation in the GL case), the mean score for tas alone is high (859 for ERA) compared to that for psl (611) and pr (562). The range of scores across the models is, however, smallest for tas. This is despite some considerable differences between the GL mean values of annual mean tas, ranging by 4 K. Typically, rmse is not greatly reduced if the domain means are first removed. Consideration of scores for each continent indicates that the variation for tas tends to be more similar to that for pr and psl over these smaller domains.
The average of the GL scores for the third observational set for precipitation, GPCP (which is first interpolated to the ERA grid), is a little higher (by 30 points) than for ERA, which seems consistent with the sensitivity of scores for pr to the resolution of the data noted by Gleckler et al. (2008). The individual scores are given in the supplement. The extremes of pr from ERA are over 32 mm day−1 in each season, while those of GPCP are under 14 mm day−1, and more typical of peaks from the lower-resolution models. Nevertheless, the series of 25 scores for ERA correlate with those from GPCP well, with r = 0.88, so the choice of dataset makes little difference to the ranking of models in this case.
For more general comparisons, it is worthwhile combining the scores from all the domains, and for simplicity we take a simple seven-score average, including GL with the six continents. The correlation between the ERA pr scores and those for GPCP rises to 0.92, which is similar to the agreement between the Data CRU and ERA scores for tas and psl.
It is worth noting that, as found by Reichler and Kim (2008), the “multimodel mean,” where each score is calculated using the average of the 25 model fields (interpolated to the data grid), performs better than any individual model in the GL case (Table 1). The values are 792 for CRU and 776 for ERA. This superiority holds for the continents using CRU (excepting NA). However, for ERA, it holds only for AU, with the mean falling short of the top model by as much as 56 points, for SA, evidently as a result of larger regional differences (to ERA) in some models.
The top-ranked model in CMIP5, by the average of M scores for Data ERA, is indicated for each domain in Table 2. The models from Europe (EU) prevail in each case except South America, but four different models are represented. The top-ranked model for CRU is the same (as Table 2) for each domain, except for AU, where AU1 is first, and EU, with EU4 leading. After averaging over the seven domains, CNRM-CM5 remains on top for CRU, but for ERA its score is exceeded by MPI-ESM-LR (as given in Table 3, under the column head Avg). The overall average of the values from the seven domains and two datasets has ultimately been used to give the order for CMIP5 in Table 1. The values range from 558 to 696 points. The number in the code is based on the ranks within the partition of models by home continent. Based on this order, European models take four of the top five ranks. The success of the HadGEM-based models (AU1, EU3, EU4, and AU2) is also evident. A 14-score average has likewise been applied to the CMIP3 models, with a range from 432 to 683 points. This ranking of the CMIP3 models is similar to that of Reichler and Kim (2008). As for CMIP5, European models take the top two places. The overall scores for models from both ensembles are featured in Fig. 3, along with the model grid length.
The ensemble mean of the overall scores, for each variable and the average, is given in Table 3, for both CMIP5 and CMIP3. The skill for each of the variables is improved in CMIP5 (see Table 3), on average by 31 points. The smaller range of (three-variable average) scores in this CMIP5 ensemble does, however, result in the increase in the median value, over CMIP3, being smaller, by some 19 points. The top-ranked models for each variable (for Data ERA, in Table 3) again feature European models.
INFLUENCES ON MODEL SKILL.
What can be deduced from this limited analysis about characteristics of models that lead to higher skill (for the present climate), and that have led to a modest improvement in CMIP5 over CMIP3? Consideration of the maps in Fig. 1 and the supplement shows that errors relative to the observational data tend to be larger over higher orography. This is particularly evident in results for psl, the pressure extrapolated to mean sea level, notably over the Tibetan Plateau during the June–August season. The CNRM-CM5 model (or EU1) does relatively well there, and it attains its top overall ranking through substantially better scores than other models for psl over the GL and Asia (AS) domains (where it is ranked first). Using the NCEP psl climatology produced very similar scores. The superiority of EU1 for psl does not extend to the global ocean domain, however. Given that over land an extrapolation to sea level is required, a potential contributor to the success of EU1 is that it uses a method that is more akin to that used for the ERA and NCEP reanalyses. Another apparently systematic difference between models occurs for tas over the Amazon and other tropical rain forests. One way individual models might score better would be for improvements to be made in the way their modeled fields are determined in such situations, with respect to those of the observational data. This might be specific only to the dataset, however.
The partition of CMIP5 models into ES and AO categories (see Table 1) is of some interest, although the present climate simulation of tas, pr, and psl is not expected to be sensitive to the inclusion of Earth system components. This is evident from the similarity of scores among the HadGEM-based models (the two ES versions, in comparison to the two ACCESS AO models). It is initially surprising then to find that the average of the overall scores for the ES models is 608, some 22 points lower than for the AO models, with the values shown in Fig. 3. A similar contrast occurs for each variable (the plots are in the supplement). Considering only the AO models, the improvement in CMIP5 over CMIP3 (all in the AO category) is more substantial (at 43 points).
Five CMIP3 models included some flux adjustment, which likely improves their present climate. However, the average overall score for the five is 586, virtually the same as for the full set (Table 3).
Examination of Table 1 and Fig. 3 suggests that differing resolution among the ensembles and categories is a factor in these comparisons. The average grid length for our 25 CMIP5 models is 180 km, in contrast to the 242-km average for CMIP3. There is no established way to link skill scores and model resolution in general, but we have considered several alternatives using linear regression applied to the M-based skill scores and powers of grid length (such as the square, giving area). The correlation is consistently highest (in magnitude) simply for the length itself. Values of r for our 25-member CMIP5 ensemble and Data ERA are given in Table 2, for each domain, and in Table 3, for each variable. In all but the AF case, the correlation between the skill score and grid length is above 0.4 in magnitude, and it is highly unlikely to occur by chance. The correlation for pr is −0.72. Using rmse as the skill metric also produces correlations above 0.4 for each variable.
Using the combined Data CRU and ERA skill scores, and the combined set of 49 models, we again find a consistent relationship between score and grid length, as seen in Tables 2 and 3. The overall score produces the largest magnitude of r, −0.71. The regression line giving the corresponding linear relationship between the score and length is shown in Fig. 3. The averages for each category, as well as for the 25 CMIP5 models (not shown), all lie very close to the line. Using this linear fit, we would expect that the reduction in length for CMIP5 would correspond to a boost in score of 30 points, almost matching the actual value. The ratios (as percentages) for this case and others are included in the tables. There is some variation around 100%, but for each domain and variable, it seems that much or all of the improvement in CMIP5 on average relates to the improvement in resolution of the models. Recall that in calculating the M values, each model field was first interpolated to the CRU or ERA grid, largely removing a direct effect of the analysis grid. This was confirmed by similar correlations holding for scores for CRU and ERA, individually.
Scores for the additional, high-resolution model MIROC4h are also included in these scatterplots, standing out at the far left. While not the top-scoring model overall, its point in Fig. 3 lies on the regression line, and hence it is an independent result in support of an influence of resolution. The MIROC4h score for pr is the highest, in part because the model has the highest average (seven domains) for pr from the higher-resolution Data ERA. It is one of few models that has a higher score for that pr climatology than it has for GPCP, and it also has a lower rmse for ERA (this holds for data interpolated to the GPCP grid also). While MIROC4h is evidently not the most realistic model for rainfall that is aggregated over larger regions, its resolution enables it to match the greater spatial variation of ERA.
It can be seen from the average values in these scatterplots that the difference between the average scores for the ES and AO categories of CMIP5 also relates to the difference in resolution. The average length for the AO models is 161 km, while that for ES is 198 km. Presumably, some groups have had to compromise with resolution for their ES model, due to the added computational cost of the ES components. For instance, the MPI-ESM-LR model (Giorgetta et al. 2013) has the same horizontal grid as the MPI group's CMIP3 model (an AO model), and its skill score is only 12 points higher. Some groups [notably CNRM, GISS, and National Center for Atmospheric Research (NCAR)/CCSM] that submitted a higher-resolution AO model achieved a greater jump in score. For instance, the new GISS-E2-H model is run at half the grid length of GISS Model E, coupled with the Hybrid Coordinate Ocean Model (HYCOM) ocean model (GISS-EH). Note also that the CMIP3 models with flux adjustments have a lower resolution on average, so their scores are better than would be expected, presumably as a result of those constraints.
These results suggest that resolution is an influence on model skill, although for most modeling groups improvements in resolution are made in association with model code development. Indeed, Pope and Stratton (2002) showed that making increases in horizontal resolution, alone, need not have much effect on climatological temperature and rainfall. It is noteworthy that while the top four models from Europe (EU1–EU4) are better resolved than the CMIP5 average, they also perform considerably better than expected from the regression relationship (as can be inferred from Fig. 3). This is consistent with the long history of effective model development in these European groups, and evidently an attention to climatological skills during the course of model development. Both resolution and development were linked to the success of European weather modeling recently by Emanuel (2012; see also Magnusson et al. 2014, and McNally et al 2014).
A home advantage?
Can evidence of improved skill in the home continent, where the model was developed, be found in these scores? We have seen that the models from Europe have performed particularly well in this CMIP5 ensemble, and ranked by score for Data ERA an EU model (EU3) performs best over the EU domain (in Table 2). This is displayed in Fig. 2 by the use of stars, instead of circles, to indicate a score from a home model (so this largest EU score is marked by a green star). Further, the four models performing least well for EU are not from EU, which suggests that the EU model developers have avoided low home scores—but the effect is not dominant.
The distribution of stars for the other continents is rather scattered among the points of the same color. Another EU model is the top for AS, while a third is top for NA and AU. Evidently, there is no strong home advantage in the models from those three continents. The CCSM4 model (from NA) is top ranked for SA. The scatter of stars among the values for the CMIP3 case (see the supplement) presents a similar picture.
Since the absolute scores vary between the GL and the continents, these cannot be used directly to demonstrate any home advantage. While limited to discrete steps, the ranks themselves are relevant. Do models tend to be ranked more highly in their home continent than elsewhere? For CMIP5 models, the home and global ranks can be seen directly in Fig. 4. Several models are much better ranked at home than globally (including EU7, NA3, NA6, and AS2). Several do not as well, notably EU1, which benefits in GL from its small Tibetan error. The average of the ranks based on home continent is also plotted. Each is a little better for Home. The same is true of the Data CRU case (not shown). The deviations from the median rank (13 here, and taken as a negative of the rank value) are given in Table 4. There is a home advantage (with the value Home minus GL being positive) in all eight cases, but the largest is merely 2.3 ranks for AU. In fact, this is mostly due to CSIRO Mk3.6, which has the 12th best score for AU (Data CRU) but 19th best for GL (Table 1), and this improvement in rank is largely from its AU precipitation score.
An assessment of the errors in the variables at the grid points qualitatively supports a weak favoring of the home continent (see the supplement). It is also evident that the AS models perform well over eastern Asia, where the models were developed. However, errors over Siberia reduce their skill over the AS domain.
The ranks for CMIP3 models (plotted in the supplement) also display considerable differences between the Home and GL ranks. Several NA models ranked much more highly for Home, including the Canadian models [Coupled Global Climate Model, 3.1 (CGCM3.1) versions T47 and T63) and NCAR's Parallel Climate Model (PCM). The average advantage, from Table 4, for NA was nearly two ranks for both datasets. The AU average, of the two earlier versions of CSIRO Mk3.6, was as much as six better for Home, using ERA. The EU difference is small and of mixed sign. However, the AS average does not favor Home (as indicated by using italics in Table 4), but again those models do better for eastern Asia. Overall then, the home rank tends to be better, but mostly only by one or two ranks.
Based on a nondimensional skill score measuring agreement between seasonal mean fields of three surface variables (tas, pr, and psl), the 25 CMIP5 climate models analyzed simulate the present climate with a range of skill, although naturally even the best falls short of the agreement level between the two observational datasets used. Overall, the CMIP5 ensemble represents a modest improvement in skill over the earlier CMIP3, for global land (GL; excluding Antarctica) and for each of six continents. Using combined scores, the improvement is 31 points (out of 1000), compared to the CMIP5 range of 138 points.
The models submitted by European groups have performed especially well in this CMIP5 ensemble, being highest ranked for the globe and for five continental domains. The NCAR CCSM4 model is the best over South America. The four models that include a HadGEM-based atmosphere, including the new ACCESS model (in two forms) from Australia, also scored particularly well.
While upgrades in model formulation have undoubtedly led to improvements for most groups, the inclusion of Earth system components does not appear to have directly influenced these skill scores. The larger improvements followed from the use of higher resolution, which we quantify here simply using a horizontal grid length. Across the combined ensemble of 49 models, the correlation of skill score with length was r = −0.71. The overall improvement in CMIP5 was consistent, by this relationship, with the 26% reduction in average length. This apparent influence was seen for each continent and each variable. Conversely, the skill of some Earth system models was apparently limited by their modest horizontal resolution.
The models in both ensembles were submitted by groups from four of the six continents. Do the models tend to perform better in their home continent? This has been tested by comparing the Home rank of each model with its GL rank. There are some models where the Home rank is substantially better, and for the Australian CSIRO Mk3.6 model (and its predecessors in CMIP3) this holds, largely because of relatively good simulation of Australian rainfall. However, there are also models that have no home advantage, although any superiority for GL is usually small. Averaging over the home models for each of the four continents, and for the two datasets, there is a consistent improvement for the home rank for CMIP5, but it amounts to only one or two ranks. For CMIP3, the North American models performed notably better over North America.
These results should be reassuring for those wishing to use the CMIP5 ensemble for climate change studies. There is little evidence that models have been developed in a way that favors the simulation of climate in their region of origin, at the possible expense of realism in global processes. On the other hand, studies with a focus on climate impacts in a particular continent may take advantage of the better skill evident in some models, and not necessarily those developed in the region. Rather, horizontal resolution appears to have a consistent relationship with such skills. Further evaluation of models with regard to resolution, particularly for variables known to be sensitive to it, is warranted.
The provision of data by the CMIP5 modeling groups, PCMDI (supported by the U.S. Department of Energy) and NCI, and further analysis by CSIRO team members is gratefully acknowledged. Comments from the reviewers and Prof. Bjorn Stevens were very helpful. The World Climate Research Programme's Working Group on Coupled Modelling is responsible for CMIP. This work has been undertaken as part of the Australian Climate Change Science Program— an Australian government initiative.
A supplement to this article is available online (10.1175/BAMS-D-12-00136.2)