1. Introduction
Most of the scientific basis supporting the attribution of climate change to anthropogenic causes (IPCC 2007, 2013b) is based on atmosphere–ocean general circulation models (GCMs). GCMs are deterministic numerical computational programs, with physical bases that simulate the response of Earth’s climate system (Jun et al. 2008) driven by different forcings (natural, anthropogenic, or combined) on time scales from decades to centuries, without explicitly including meteorological observations. These models are able to closely reproduce many of the physical aspects of the current climate including features of forced and unforced variability (Randall et al. 2007; Gleckler et al. 2008). The Coupled Model Intercomparison Project (CMIP), coordinated by the World Climate Research Program, is the most important international effort conducted to advance climate change modeling and to support the IPCC’s reports. The performance of GCMs that are part of the CMIP5 has improved with respect to those of the CMIP3, increasing the confidence about climate change detection and attribution at the regional and global scales (Bindoff et al. 2013). GCM performance evaluations have been used as an indicator of their reliability for future projections (Jun et al. 2008; Baumberger et al. 2017), to assign weights to individual models in a multimodel ensemble (Tebaldi and Knutti 2007; Knutti et al. 2010; Christensen et al. 2010; Herger et al. 2018), and to select a subset of the “best” models (Perkins et al. 2007; Maxino et al. 2008; Knutti et al. 2010; Sanderson et al. 2015; Herger et al. 2018), which are then used for conducting climate change assessments such as impact, vulnerability, and adaptation studies (IPCC 2014). However, it is also recognized that the usefulness of a GCM cannot be inferred solely from its degree of agreement with observations (Notz 2015).
In principle, GCMs could be considered independent as they have been developed by different groups and with somewhat different modeling strategies and goals. Most studies, including this one, assume explicitly or implicitly that each GCM is independent from the others and as a random drawn from a distribution with the true climate as its mean (Tebaldi and Knutti 2007; Jun et al. 2008). This implies that the average of an ensemble of GCMs should converge to the true climate as more GCMs are included (Jun et al. 2008). In practice, GCMs are hardly independent as some share common genealogy, numerical schemes for solving equations, parameterizations, and/or components (Jun et al. 2008; Christensen et al. 2010; Masson and Knutti 2011; Steinschneider et al. 2015). In consequence, it is not completely justified to assume GCMs as independent in ensembles of opportunity (Masson and Knutti 2011; Knutti et al. 2013). However, the relationship between models and families of models is not clear and the authors believe that there is no objective and convincing way yet to address this problem.
It is generally accepted that the skill of a GCM to simulate the climate for which there are observational instrumental records is a measure of its performance (Jun et al. 2008; Notz 2015; Baumberger et al. 2017). However, simple comparison between simulated and observed climate to assess model skill may not be adequate. The twentieth-century simulations included in the CMIP5 experiment were designed to reproduce the response to observed changes in natural and anthropogenic radiative forcing (i.e., the climate signal), but were not constrained using climate records to replicate relevant aspects of observed variability such as the time of occurrence and phase of natural oscillations. In particular, it is known that low-frequency natural variability can significantly distort the response to radiative forcing (RF), masking it or exaggerating it, depending on the phase of the oscillation (Tsonis et al. 2009; Wu et al. 2011; Estrada et al. 2013b). Initial conditions in GCMs have a much larger effect than previously thought and can significantly distort the warming trend over the observed period and even in future projections (Wallace et al. 2015; Deser et al. 2014). Differences between observed and simulated internal variability are confounding factors that can affect common approaches for evaluating GCM performance (Maraun et al. 2010).
There are a large number of metrics that could be selected to evaluate a wide variety of aspects (Christensen et al. 2010), and new metrics are commonly proposed for particular purposes (Knutti et al. 2010; Herger et al. 2018). However, there is no objective approach for choosing metrics to evaluate GCMs performance and there is little consensus on which metrics are useful to discriminate “good” from “bad” GCMs (Knutti et al. 2010). An a priori subjective selection of a limited set of metrics with largely unknown interdependencies is difficult to avoid (Christensen et al. 2010). Furthermore, objective methods to evaluate GCM performance could be useful to maximize the value of climate change projections (Knutti et al. 2010). However, empirical evidence supporting this statement is, at best, weak and GCMs with good historical performance could underperform when projecting future climate (Weigel et al. 2010; Notz 2015).
The objective of this paper is to present a new methodology to evaluate GCMs’ ability to reproduce the response of the observed mean surface temperature anomaly (MST) to changes in RF at the global and regional scales. Linear least squares regression and formal statistical tests are combined to develop an objective, systematic, and robust method for evaluating GCM performance for climate change applications. The proposed methodology focuses on the central problem of GCM evaluation in the context of climate change: determining the skill of GCMs to simulate the climate system response to changes in external radiative forcings. This objective is achieved using time series models similar to those that have applied for the study of different aspects of climate variability (Tol 1996; Taylor and Buizza 2004), downscaling methods (Estrada et al. 2013a; Estrada and Guerrero 2014), impact assessment (Burke et al. 2015; Hsiang 2016), and climatic change detection and attribution (Tol and de Vos 1993; Harvey and Mills 2002; Qu 2011; Estrada et al. 2013b,c; Estrada and Perron 2014). The proposed methodology underlines the similarities and differences between the evaluation of GCMs and attribution studies, as it is based on evaluating the capacity of models to reproduce observed warming trends.
Section 2 describes the databases of observed and simulated MST used in this study. In section 3 the proposed methodology is presented, and the limitations of classical metrics to determine GCMs performance are discussed. Section 4 shows the results of the proposed methodology applied to the global scale and eight subcontinental land domains distributed across the globe. Conclusions and a summary are given in section 5.
2. Data
a. Observational data and spatial domains
We considered observational data of ocean and land monthly average surface temperature (in K) from two gridded observational datasets: 1) the Hadley Centre Climate Research Unit temperature anomalies (HadCRUT4, version 4.6.0.0) on a 5° × 5° global grid, available for the period 1850–2018 (Morice et al. 2012) and 2) the GISS surface temperature analysis (GISTEMP v4) on a 2° × 2° global grid, available for the period 1850–2005 (GISTEMP Team 2021; Hansen et al. 2010; Lenssen et al. 2019). Temperature over land is measured at stations, whereas temperature over the ocean is derived from sea surface temperature and marine air temperature measurements taken by ships and buoys (Jun et al. 2008). Each of these research centers conduct independent analyses of data quality, inhomogeneities, and corrections of instrumental biases at the grid cell level.
To illustrate the proposed methodology, our analysis focuses on the global scale and on eight subcontinental land regions (Fig. 1) that are characterized by different climatic regimes and for which GCM performance has been evaluated in previous research (IPCC 2013a; Qian and Zhang 2015; Chan and Wu 2015). The selected regions are the United States (USA), western Europe (EuW), northern Europe (EuN), Mexico (Mex), and China (Chi) in the Northern Hemisphere (NH) and the Amazon (Ama), southern Africa (SAf), and Australia (Aus) in the Southern Hemisphere (SH). Ocean regions were not included because data tend to be sparser.
The period 1910–2005 was chosen for this study since these are the years for which the HadCRUT4 and GISTEMP datasets have no missing data in more than 70% of the grid points over most of the selected domains. Data completeness is similar for both HadCRUT4 and GISTEMP datasets for the domains located in the NH, with the exception of China. Data gaps are larger in the SH and this is more evident for GISTEMP over regions located in the Amazon and southern Africa than for HadCRUT4 due to differences in data processing mentioned above. The spatial average and time series anomalies of annual MST were calculated with respect to the reference period 1961–90 (Fig. 2).
The observed global and regional MST are influenced by atmospheric and oceanic natural climate variability that can distort the underlying response of the MST to changes in RF. To account for their confounding effects, natural variability modes are considered in our analysis. Modes like the Atlantic multidecadal oscillation (AMO), the Pacific decadal oscillation (PDO), the Southern Oscillation index (SOI), the northern annular mode (NAM), and the North Atlantic Oscillation (NAO) have a stronger influence in the NH regions (Hu et al. 2003; Englehart and Douglas 2004; Brönnimann et al. 2007; Riaz et al. 2017; Brunetti and Kutiel 2011; de Beurs et al. 2018; Dong et al. 2019). The dipole mode index or Indian Ocean dipole (IOD), southern annular mode (SAM), North Pacific index (NPI), AMO, and SOI influence SH regions (Mason and Jury 1997; Power et al. 1999; Tyson and Preston-Whyte 2000; Hendon et al. 2007; Fogt et al. 2011; Ashcroft et al. 2014; Lakhraj-Govender and Grab 2018). Annual time series of these natural variability indices were obtained from the following sources: AMO (NOAA; Enfield et al. 2001), PDO (JISAO; Mantua et al. 1997), SOI (CRU; Ropelewski and Jones 1987), NAO (CRU; Jones et al. 1997), NAM (NCAR 2019), NPI (Hurrell et al. 2019), IOD (NOAA; Saji and Yamagata 2003), and SAM (MESNZ 2017).
Observed radiative forcing time series were obtained from GISS-NASA (Hansen et al. 2011), commonly used in the literature in the estimation of the transient climate response (Gregory and Forster 2008; Schwartz 2012) and attribution studies (Kaufmann et al. 2011; Estrada et al. 2013b; Pasini et al. 2017; Estrada and Perron 2019). The radiative forcing from well-mixed greenhouse gases, land use change, ozone, stratospheric H2O, aerosols, black carbon, solar irradiance, and snow albedo are aggregated into total radiative forcing (TRF), which summarizes all variables that have a trending behavior (Estrada et al. 2013b). The radiative forcing from stratospheric aerosols (VOLC) is also considered to account for the effects of volcanic eruptions.
b. GCM output
We use 107 realizations of 2-m surface temperature from 21 GCMs included in the CMIP5 historical experiment. We selected GCMs that had at least two realizations (Table 1; Taylor et al. 2012). The multimodel mean, the 21 ensemble means from each GCM, and the 107 individual model runs were included in the analysis. The sample was chosen to match that of the observations (1910–2005). All simulations are available in standard NetCDF format at https://esgf-node.llnl.gov/search/cmip5/ and at https://cera-www.dkrz.de/WDCC/ui/cerasearch/. The spatial resolution of model output varies across GCMs and thus simulations were regridded to two common grids that correspond to those of the HadCRUT4 and GISTEMP datasets using bilinear interpolation (Jun et al. 2008). Annual MST anomalies with respect to 1961–90 were obtained using monthly frequency data and the spatial average was calculated for each GCM and region (Fig. 2).
GCMs included in the CMIP5 historical experiments that had at least two realizations. Resolutions are provided at https://portal.enes.org/data/enes-model-data/cmip5/resolution.
Binary masks of monthly missing/available grid points obtained from each observational dataset (HadCRUT4 and GISTEMP) were applied to GCMs output in order to mimic observed and simulated data. This allows the assessment of GCMs simulations where and when observations are available and thus reduces biases introduced by observational coverage as much as possible (Hegerl et al. 2007; Knutson et al. 2013; Cowtan et al. 2015).
All GCMs included in the CMIP5 produce numerical experiments that are dependent on a set of initial conditions and external forcing scenarios to simulate, for example, past, present, and future climates. The ensemble mean was calculated for each GCM to 1) produce a clearer climate signal since averaging over realizations dampens variability and provides a better representation of the model’s response to changes in radiative forcing (Jun et al. 2008; Knutti et al. 2010; Annan and Hargreaves 2011), and 2) reduce the variability in simulations that would otherwise contribute to the error component in any statistical model (Jun et al. 2008; Deser et al. 2014). For calculating the ensemble mean each realization is equally weighted, which can be considered a more transparent strategy to combine GCM outputs (Weigel et al. 2010; Herger et al. 2018), since the only difference between simulations from the same GCM using the same external forcing and physics configuration is the sets of initial conditions. These initial conditions are, for any practical reason, to be considered random (Maraun et al. 2010) and as such there is no reason to assign lower or higher weights to any particular run.
3. Methodology
This section is composed of two parts. First, we analyze some of the metrics that are commonly used to evaluate GCM performance, and we show that such metrics are not helpful to discriminate GCMs based on their skill to reproduce observed climate and that they may not be informative for climate change applications. The results motivate the need to develop new metrics that are more robust and adequate for climate change applications. The second part of this section proposes a new methodology to evaluate GCM performance that tackles weaknesses identified in classical metrics and that focuses on the ability of GCMs to reproduce the observed response to changes in external radiative forcing.
a. Assessment of classical metrics for evaluating GCM performance
The process of developing, evaluating, and combining GCM performance metrics is not straightforward. Rather, there is a considerable amount of subjectivity in the selection of metrics and in their interpretation (Christensen et al. 2010). Some of the metrics that have been proposed to evaluate GCMs performance include the magnitude of model biases during the observed period, comparison of trend slope sign and magnitudes, or composites of a large number of model performance diagnostics (Weigel et al. 2010).
To illustrate some of the limitations of these commonly applied metrics, we evaluate the performance of GCMs in reproducing the observed annual MST time series from HadCRUT4 and GISTEMP for the global and USA domains. The metrics chosen for this illustration are the Pearson linear correlation coefficient and the root-mean-square error (RMSE). These metrics were calculated for the annual mean MST series from 107 individual realizations of 21 GCMs and for the ensemble mean (rE) of each GCM. As is common in the evaluation of climate models for climate change applications, the trend was not removed because it is the main component of interest and the association between observed and simulated variations around the trend is expected to be close to zero because of the CMIP5 twentieth-century experiment characteristics, which focus on reproducing the response of the climate system to changes in observed radiative forcing; these are “free-running simulations” (i.e., with no nudging or data assimilation) and thus internal model variability and observations have no reason to be related (Maraun et al. 2010; Estrada et al. 2012; Deser et al. 2014; Sun et al. 2019).
Figure 3 shows the calculated correlation and RMSE value for all simulations and ensemble means, in decreasing order from highest to lowest correlation values. This figure reveals that 1) these metrics are hardly independent as higher correlation values are associated with lower RMSE values (this is of importance because the lack of independence between metrics is seldom accounted for in practice and can generate reinforcement biases) and 2) ensemble means tend to show higher correlation values than any individual realization; as discussed in the following paragraph, confidence intervals for the correlation coefficient tend to be smaller for models with higher correlation values (i.e., ensemble means).
While the point estimates of these metrics may clearly suggest that some GCMs have better performance than others, the uncertainty in these estimates needs to be accounted for to infer how different these values really are. This is not commonly done when classical metrics are applied, but it should be a standard practice, as it is in other fields. We constructed 95% confidence intervals for the estimated correlation coefficients and the RMSE values by calculating the standard error using the bootstrap method, which is based on resampling with replacement to approximate the empirical distribution of sample estimates (Efron and Tibshirani 1998).
The results show that most of the confidence intervals calculated for the correlation coefficients and RMSE values overlap. Although we show results for only two domains, similar results are found for all other domains. This illustrates that these metrics do not provide strong statistical evidence for supporting the use of one GCM over another, for ranking models, or for assigning different weights. Based on these metrics, GCMs show a similar skill for simulating the observed MST annual time series. Note that ignoring the confidence intervals, these metrics could lead to very different conclusions regarding model ranking and selection of the “best” GCMs and weights, depending on the particular realization (r1, r2, …) that is chosen for each model. It is important to remember that realizations from the same model differ only in the initial conditions, and that those sets of values are chosen randomly. As such, using these conventional metrics, without accounting for their uncertainty, could be as effective as randomly choosing a set of models, ranking them, or assigning weights. This example illustrates the lack of robustness of conventional metrics to discern the differences in performance of a set of GCMs and it shows that results are sensitive not only to the selected metric but to the particular realizations that are chosen.
Furthermore, these commonly used metrics offer no information about how well the GCMs can reproduce the climate system response to changes in external radiative forcing. These metrics, like many others, are not meaningful to evaluate the change in climate, as their objective is to compare climatological (static) states, not how they evolve. In addition, most of the classical metrics do not consider the effects of factors such as internal variability and differences in initial states. In particular, low-frequency oscillations can considerably distort trends in climatic variables, either in observations or in GCM simulations (Swanson et al. 2009; Wu et al. 2011), and initial conditions can have similar effects on simulations (Wallace et al. 2015; Deser et al. 2014). Below, we present a methodology based on ordinary least squares regression that focuses on assessing GCM performance in reproducing the response to RF embedded in the observations.
b. Performance evaluation based on regression models
Another potential source of nonstationarities is υt, which contains a variety of oscillations with different frequencies. It has been shown that low-frequency natural variability can distort the trend of climatic variables (Swanson et al. 2009; Wu et al. 2011). Large differences in phase and amplitude of low-frequency oscillations between observed and simulated variability can produce nonstationarities in the residuals of the regression.
Equation (6) can be expressed as
The evaluation of the performance of a particular GCM is determined by analyzing the regression’s coefficients and residuals: an accurate representation of the observed response to RF requires the coefficients
The proposed regression approach has important advantages over other methods found in the literature. Among these are 1) the statistical significance of the bias parameter
The proposed methodology is implemented in two steps. First, an auxiliary regression model based on Eq. (6),
It is important to note that the estimation of coefficients is independent in these two steps, in contrast to other estimation approaches such as two-stage least squares in which the coefficients estimated in the first stage are used in the second stage estimation. Once regression (6) has been estimated, the performance of the GCM is determined by evaluating two things. The first is the similarity of the trend in observed and simulated MST time series by testing the stability of parameter
4. An analysis of GCM performance over global and regional domains
In this section we present an application of the proposed methodology for the global domain and eight subcontinental land regions (Fig. 1). We consider the annual mean surface temperature time series from HadCRUT4, GISTEMP, and the ensembles of simulations produced by 21 GCMs, as well as the multimodel ensemble (see sections 2a and 2b). The regression models include two groups of independent variables: 1) those that approximate the warming trend, namely TRF and the ensemble mean of each GCM, and 2) the set of variables Xt,j that include the main natural variability modes (AMO, PDO, SOI, NAM, NAO, NPI, IOD, and SAM; see section 2a), stratospheric aerosols (VOLC), and the persistence of MST,
AMO has a significant influence at the global scale and over most of the domains located in the Northern Hemisphere (Bindoff et al. 2013; Steinman et al. 2015; Guan et al. 2015). In such regions, the positive phase of AMO is associated with higher temperatures and its influence is largest over Europe and North America (Fig. 4b). AMO is characterized by a low-frequency oscillation that has been shown to obscure the warming trend by masking or exacerbating it depending on its phase (Swanson et al. 2009; Wu et al. 2011). PDO and NAO have also been proposed as variability modes that can distort the global warming trend and, as expected, the regression models for the global domain include the AMO, PDO, and SOI, which also have a global effect on temperatures (Guan et al. 2015; Li et al. 2013b; Cohen and Barlow 2005). Figure 4b provides empirical evidence of the effect of observed natural variability modes on MST, the estimated models show that AMO has a significant influence in most of the domains (7 out of 9), followed by SOI, NAO, and PDO. The remaining variability modes have influence over particular regional domains. As shown in the literature, NAO and NAM show relevant effects over regions such as Europe and Mexico (Fig. 4b; Li et al. 2013a; Vihma et al. 2019), while SAM and IOD mainly influence regions in the Southern Hemisphere (Wang and Cai 2013).
Figure 5 shows the range of values of the estimated regression parameters for the nine domains, and for the HadCRUT4 and GISTEMP datasets. This figure includes the results for the mean of 21 GCM ensembles, as well as for the multimodel mean. The right panel of Fig. 5 shows only the parameter values of regressions for which parameter
If no structural break is present and the functional form is correct, then we test for under/overestimation of the observed warming rate using Wald tests. In the case of regressions that include a number p of lagged terms of the dependent variable to correct for autocorrelation, the
Figure 6 compares the 21 GCM ensemble means (gray lines) and the fitted regression models that satisfied the parameter stability and linearity assumptions (blue lines) as well as those for which
Figure 7 illustrates the cases in which 1) regression models satisfy the parameter stability and linearity assumptions and
Considering the HadCRUT4 (GISTEMP) dataset, only 12 (8) of the 21 GCMs plus the multimodel ensemble mean, produce regressions in which parameter
It is worth noting that these results depend on the temperature database used, as also happens with traditional metrics. The differences in results between datasets are more common in regions where there are more data gaps. Differences in spatial coverage and temporal continuity, as well as in data and gap-filling processes, can generate disparities in the warming trends contained in each dataset. For instance, GISTEMP tends to show higher warming in most domains during the second part of the twentieth century when compared to HadCRUT4 (Fig. 2). These differences are larger in regions located in the Southern Hemisphere where data coverage is sparser, and smaller in regions with fewer data gaps such as the United States and Europe. Differences in data coverage and quality likely influence results shown in Fig. 8.
The ability of GCM simulations to reproduce the magnitude of the observed warming also varies between regions and depends on the observational database that is used, the ability of current GCMs to adequately simulate the spatial distribution of warming, and factors related to RF. For the majority of GCMs (>50%) that are not able to reproduce the warming trend (i.e., parameter stability is not satisfied; see Fig. 8) this problem occurs in the following domains: EuN and EuW (GISTEMP) and Ama, Chi, EuN, and EuW (HadCRUT4). In such domains, GCMs tend to simulate higher rates of warming than observed.
The lack of agreement between observed and modeled warming rates has been discussed in the literature, and three main hypotheses can be identified: low-frequency natural variability and feedback processes, unaccounted external radiative forcing factors or changes in their rate of growth, and deficiencies in temperature datasets (see Estrada and Perron 2017). However, the observed slowdown was most likely caused by a combination of multiple factors and cannot be attributed to any particular one. For some of the GCMs, the overestimation of the warming trend starts in the 1970s, but this becomes more pronounced from the mid-1990s onward, as documented by the results of the structural change test that we applied (Fig. 8). This finding is in agreement with what is reported by Fyfe et al. (2013), who conclude that GCMs from CMIP5, with the prescribed forcings, do not reproduce the slowdown from 1998 to 2012. Moreover, most of these GCMs tend to overestimate warming trends during recent decades compared with observations (Kim et al. 2012). This overestimation of the warming trend could be related to the limited ability of GCMs to adequately simulate regional feedback processes such as the Arctic amplification, which became more pronounced since the 1990s, and a variety of local and remote feedback processes related to it (Gillett et al. 2008; Cohen et al. 2019). The existence of unaccounted RF factors or important changes in their rate of growth has also been proposed as a possible explanation for the observed slowdown in the warming trend during the late twentieth century (Estrada et al. 2013b; Steinman et al. 2015).
Eleven out of 22 GCMs are able to reproduce both the trend of the observed response to RF and the magnitude of the warming rate for at least 30% of the selected domains. In the case of the HadCRUT4 dataset, these models are CNRM-CM5, HadCM3, CSIRO-Mk3.6.0, CSIRO-Mk3L-1, GISS-E2-H, MIROC5, and the multimodel mean, while for the GISTEMP dataset these are ACCESS1.0, ACCESS1.3, NorESM1-M, MIROC-ESM, and multimodel mean. The domains for which at least 40% of the GCMs are able to reproduce both the trend of the response to RF and the warming rate of the observational datasets are Gbl, Aus, SAf, and Ama, while for the domains in the Northern Hemisphere most of the GCMs tend to significantly overestimate the warming rate (
5. Conclusions
In this paper we show that GCM evaluation, selection, and ranking based on classical performance metrics can be misleading and the differences between these metrics can be random and meaningless. This is clearly shown when comparing the metrics’ confidence intervals instead of just point estimates. Compared to commonly used metrics, the proposed methodology introduces relevant improvements. It allows us to formally evaluate two of the most relevant aspects for climate change projections: 1) if the trend of the response to RF from a particular GCM is compatible with observations and 2) if the magnitude of the response to RF is similar to that in observations. The proposed method allows us to evaluate the performance of GCMs in reproducing the observed warming trend in a multivariate setting in which the effects of natural variability are accounted for.
The methodology allows us to formally test for these two characteristics and to evaluate the statistical significance of differences between observations and GCMs, as well as between different GCMs. This new approach is based on formal statistical tests that provide empirical evidence that allows us to classify GCMs into groups that 1) are able or unable to adequately reproduce the observed warming trend and 2) under/overestimate or accurately estimate the magnitude of the response to RF. These tests can also be applied to jointly evaluate metrics. Furthermore, confounding factors that may distort the response to RF, such as natural observed variability and the internal GCMs variability, are controlled for. These improvements in GCM evaluation can be of particular importance for applications such as impact, vulnerability, and adaptation assessments and for detecting areas of opportunity to improve current GCMs.
We apply the proposed methodology to nine spatial domains and show that from the GCMs considered (21 models plus the multimodel mean) in this study, only 40% of them are able to reproduce the observed warming in the Gbl, Aus, SAf, and Ama regions, in terms of both its trend and rate of increase. Less than 40% of them are able to reproduce the trend and magnitude of warming in Chi, EuN, EuW, Mex, and USA regions, and most GCMs overestimate the warming rate. While most of the classical performance metrics only provide relative measures of how well GCMs are able to reproduce the observed climate, the proposed method is based on stricter and more informative criteria to discriminate models that can and cannot reproduce two of the most relevant aspects of performance for climate change studies. The proposed method indicates which of these criteria GCMs fail to satisfy. Moreover, the proposed methodology allows us to identify that most of the GCMs tend to overestimate the warming in regions of the Northern Hemisphere, and that these models’ simulations tend to show significant discrepancies with the observed magnitude of the warming trend, particularly since the mid-1990s. Several explanations for the reduced warming rate during that period have been proposed (Estrada et al. 2013b; Guan et al. 2015; Steinman et al. 2015; Fyfe et al. 2016; Estrada and Perron 2017), and the lack of fit of models during this period has been discussed in the literature (Dai et al. 2015; Fyfe et al. 2016).
Acknowledgments
This study was developed as part of a PhD project in the Postgraduate Program in Earth Sciences of the National Autonomous University of Mexico, with a CONACYT scholarship. The authors are grateful to René Lobato Sánchez, Víctor Manuel Mendoza Castro, and Ignacio Arturo Quintanar Isaias for their helpful feedback and recommendations.
Data availability statement
The data that support the findings of this study are available from the corresponding author upon request.
REFERENCES
Andrews, D., 1993: Tests for parameter instability and structural change with unknown change point. Econometrica, 61, 821–856, https://doi.org/10.2307/2951764.
Annan, J., and J. Hargreaves, 2011: Understanding the CMIP3 multimodel ensemble. J. Climate, 24, 4529–4538, https://doi.org/10.1175/2011JCLI3873.1.
Ashcroft, L., D. Karoly, and J. Gergis, 2014: Southeastern Australian climate variability 1860–2009: A multivariate analysis. Int. J. Climatol., 34, 1928–1944, https://doi.org/10.1002/joc.3812.
Baumberger, C., R. Knutti, and G. Hirsch-Hadorn, 2017: Building confidence in climate model projections: An analysis of inferences from fit. Wiley Interdiscip. Rev.: Climate Change, 8, e454, https://doi.org/10.1002/wcc.454.
Bindoff, N., and Coauthors, 2013: Detection and attribution of climate change: From global to regional. Climate Change 2013: The Physical Science Basis, T. F. Stocker et al., Eds., Cambridge University Press, 867–952.
Brönnimann, S., E. Xoplaki, C. Casty, A. Pauling, and J. Luterbacher, 2007: ENSO influence on Europe during the last centuries. Climate Dyn., 28, 181–197, https://doi.org/10.1007/s00382-006-0175-z.
Brunetti, M., and H. Kutiel, 2011: The relevance of the North-Sea Caspian Pattern (NCP) in explaining temperature variability in Europe and the Mediterranean. Nat. Hazards Earth Syst. Sci., 11, 2881–2888, https://doi.org/10.5194/nhess-11-2881-2011.
Burke, M., S. Hsiang, and E. Miguel, 2015: Global non-linear effect of temperature on economic production. Nature, 527, 235–239, https://doi.org/10.1038/nature15725.
Chan, D., and Q. Wu, 2015: Attributing observed SST trends and subcontinental land warming to anthropogenic forcing during 1979–2005. J. Climate, 28, 3152–3170, https://doi.org/10.1175/JCLI-D-14-00253.1.
Christensen, J., E. Kjellström, F. Giorgi, G. Lenderink, and M. Rummukainen, 2010: Weight assignment in regional climate models. Climate Res., 44, 179–194, https://doi.org/10.3354/cr00916.
Cohen, J., and M. Barlow, 2005: The NAO, the AO, and global warming: How closely related? J. Climate, 18, 4498–4513, https://doi.org/10.1175/JCLI3530.1.
Cohen, J., and Coauthors, 2019: Divergent consensuses on Arctic amplification influence on midlatitude severe winter weather. Nat. Climate Change, 10, 20–29, https://doi.org/10.1038/s41558-019-0662-y.
Cowtan, K., and Coauthors, 2015: Robust comparison of climate models with observations using blended land air and ocean sea surface temperatures. Geophys. Res. Lett., 42, 6526–6534, https://doi.org/10.1002/2015GL064888.
Dai, A., J. Fyfe, S. Xie, and X. Dai, 2015: Decadal modulation of global surface temperature by internal climate variability. Nat. Climate Change, 5, 555–559, https://doi.org/10.1038/nclimate2605.
de Beurs, K., G. Henebry, B. Owsley, and I. Sokolik, 2018: Large scale climate oscillation impacts on temperature, precipitation and land surface phenology in Central Asia. Environ. Res. Lett., 13, 065018, https://doi.org/10.1088/1748-9326/aac4d0.
Deser, C., A. S. Phillips, M. A. Alexander, and B. V. Smoliak, 2014: Projecting North American climate over the next 50 years: Uncertainty due to internal variability. J. Climate, 27, 2271–2296, https://doi.org/10.1175/JCLI-D-13-00451.1.
Dong, X., S. Zhang, J. Zhou, J. Cao, L. Jiao, Z. Zhang, and Y. Liu, 2019: Magnitude and frequency of temperature and precipitation extremes and the associated atmospheric circulation patterns in the Yellow River basin (1960–2017), China. Water, 11, 2334, https://doi.org/10.3390/w11112334.
Efron, B., and R. Tibshirani, 1998: An Introduction to the Bootstrap. Chapman and Hall/CRC, 436 pp.
Enfield, D., A. Mestas-Nuñez, and P. Trimble, 2001: The Atlantic Multidecadal Oscillation and its relationship to rainfall and river flows in the continental U.S. Geophys. Res. Lett., 28, 2077–2080, https://doi.org/10.1029/2000GL012745.
Englehart, P. J., and A. V. Douglas, 2004: Characterizing regional-scale variations in monthly and seasonal surface air temperature over Mexico. Int. J. Climatol., 24, 1897–1909, https://doi.org/10.1002/joc.1117.
Estrada, F., and V. Guerrero, 2014: A new methodology for building local climate change scenarios: A case study of monthly temperature projections for Mexico City. Atmósfera, 27, 429–449, https://doi.org/10.20937/ATM.2014.27.04.08.
Estrada, F., and P. Perron, 2014: Detection and attribution of climate change through econometric methods. Bol. Soc. Mat. Mex., 20, 107–136, https://doi.org/10.1007/s40590-014-0009-7.
Estrada, F., and P. Perron, 2017: Extracting and analyzing the warming trend in global and hemispheric temperatures. J. Time Ser. Anal., 38, 711–732, https://doi.org/10.1111/jtsa.12246.
Estrada, F., and P. Perron, 2019: Causality from long-lived radiative forcings to the climate trend. Ann. N. Y. Acad. Sci., 1436, 195–205, https://doi.org/10.1111/nyas.13923.
Estrada, F., B. Martínez-López, C. Conde, and C. Gay-García, 2012: The new national climate change documents of Mexico: What do the regional climate change scenarios represent? Climatic Change, 110, 1029–1046, https://doi.org/10.1007/s10584-011-0100-2.
Estrada, F., V. Guerrero, and C. Gay-García, 2013a: A cautionary note on automated statistical downscaling methods for climate change. Climatic Change, 120, 263–276, https://doi.org/10.1007/s10584-013-0791-7.
Estrada, F., P. Perron, and B. Martínez-López, 2013b: Statistically derived contributions of diverse human influences to twentieth-century temperature changes. Nat. Geosci., 6, 1050–1055, https://doi.org/10.1038/ngeo1999.
Estrada, F., P. Perron, C. Gay-García, and B. Martínez-López, 2013c: A time-series analysis of the 20th century climate simulations produced for the IPCC’s fourth assessment report. PLOS ONE, 8, 1–10, https://doi.org/10.1371/journal.pone.0060017.
Fogt, R., D. Bromwich, and K. Hines, 2011: Understanding the SAM influences on the South Pacific–ENSO teleconnection. Climate Dyn., 36, 1555–1576, https://doi.org/10.1007/s00382-010-0905-0.
Fyfe, J., P. Gillett, and F. Zwiers, 2013: Overestimated global warming over the past 20 years. Nat. Climate Change, 3, 767–769, https://doi.org/10.1038/nclimate1972.
Fyfe, J., and Coauthors, 2016: Making sense of the early-2000s warming slowdown. Nat. Climate Change, 6, 224–228, https://doi.org/10.1038/nclimate2938.
Gillett, N., D. Stone, P. Stott, T. Nozawa, A. Karpechko, G. Hegerl, M. Wehner, and P. Jones, 2008: Attribution of polar warming to human influence. Nat. Geosci., 1, 750–754, https://doi.org/10.1038/ngeo338.
GISTEMP Team, 2021: GISS surface temperature analysis (GISTEMP), version 4. NASA Goddard Institute for Space Studies, accessed 28 January 2019, https://data.giss.nasa.gov/gistemp/.
Glahn, H., and D. Lowry, 1972: The use of model output statistics (MOS) in objective weather forecasting. J. Appl. Meteor., 11, 1203–1211, https://doi.org/10.1175/1520-0450(1972)011<1203:TUOMOS>2.0.CO;2.
Gleckler, P., K. Taylor, and C. Doutriaux, 2008: Performance metrics for climate models. J. Geophys. Res., 113, D06104, https://doi.org/10.1029/2007JD008972.
Greene, W., 2012: Econometric Analysis. 7th ed. Prentice Hall, 1239 pp.
Gregory, J., and P. Forster, 2008: Transient climate response estimated from radiative forcing and observed temperature change. J. Geophys. Res., 113, D23105, https://doi.org/10.1029/2008JD010405.
Guan, X., J. Huang, R. Guo, and P. Lin, 2015: The role of dynamically induced variability in the recent warming trend slowdown over the Northern Hemisphere. Sci. Rep., 5, 12669, https://doi.org/10.1038/srep12669.
Hansen, J., R. Ruedy, M. Sato, and K. Lo, 2010: Global surface temperature change. Rev. Geophys., 48, RG4004, https://doi.org/10.1029/2010RG000345.
Hansen, J., M. Sato, P. Kharecha, and K. V. Schuckmann, 2011: Earth’s energy imbalance and implications. Atmos. Chem. Phys., 11, 13 421–13 449, https://doi.org/10.5194/acp-11-13421-2011.
Harvey, D., and T. Mills, 2002: Unit roots and double smooth transitions. J. Appl. Stat., 29, 675–683, https://doi.org/10.1080/02664760120098739.
Hegerl, G., and F. Zwiers, 2011: Use of models in detection and attribution of climate change. Wiley Interdiscip. Rev.: Climate Change, 2, 570–591, https://doi.org/10.1002/wcc.121.
Hegerl, G., and Coauthors, 2007: Understanding and attributing climate change. Climate Change 2007: The Physical Science Basis, S. Solomon et al., Eds., Cambridge University Press, 663–745.
Hendon, H., D. W. J. Thompson, and M. C. Wheeler, 2007: Australian rainfall and surface temperature variations associated with the Southern Hemisphere annular mode. J. Climate, 20, 2452–2467, https://doi.org/10.1175/JCLI4134.1.
Herger, N., G. Abramowitz, R. Knutti, O. Angélil, K. Lehmann, and B. M. Sanderson, 2018: Selecting a climate model subset to optimise key ensemble properties. Earth System Dyn., 9, 135–151, https://doi.org/10.5194/esd-9-135-2018.
Hsiang, S., 2016: Climate econometrics. Annu. Rev. Resour. Econ., 8, 43–75, https://doi.org/10.1146/annurev-resource-100815-095343.
Hu, Z., S. Yang, and R. Wu, 2003: Long-term climate variations in China and global warming signals. J. Geophys. Res., 108, 4614, https://doi.org/10.1029/2003JD003651.
Hurrell, J., and Coauthors, 2019: The climate data guide: North Pacific (NP) Index by Trenberth and Hurrell; monthly and winter. National Center for Atmospheric Research, accessed 11 February 2020, https://climatedataguide.ucar.edu/climate-data/north-pacific-np-index-trenberth-and-hurrell-monthly-and-winter.
IPCC, 2007: Climate Change 2007: The Physical Science Basis. Cambridge University Press, 996 pp.
IPCC, 2013a: Climate Change 2013: The Physical Science Basis. Cambridge University Press, 1535 pp.
IPCC, 2013b: Summary for policymakers. Climate Change 2013: The Physical Science Basis. Cambridge University Press, 29 pp.
IPCC, 2014: Summary for policymakers. Climate Change 2014: Impacts, Adaptation and Vulnerability. Cambridge University Press, 32 pp.
Jones, P., T. Jónsson, and D. Wheeler, 1997: Extension to the North Atlantic Oscillation using early instrumental pressure observations from Gibraltar and South-West Iceland. Int. J. Climatol., 17, 1433–1450, https://doi.org/10.1002/(SICI)1097-0088(19971115)17:13<1433::AID-JOC203>3.0.CO;2-P.
Jun, M., R. Knutti, and D. Nychka, 2008: Spatial analysis to quantify numerical model bias and dependence: How many climate models are there? J. Amer. Stat. Assoc., 103, 934–947, https://doi.org/10.1198/016214507000001265.
Kaufmann, R., H. Kauppi, M. Mann, and J. Stock, 2011: Reconciling anthropogenic climate change with observed temperature 1998–2008. Proc. Natl. Acad. Sci. USA, 108, 11 790–11 793, https://doi.org/10.1073/pnas.1102467108.
Keele, L., and N. Kelly, 2006: Dynamic models for dynamic theories: The ins and outs of lagged dependent variables. Polit. Anal., 14, 186–205, https://doi.org/10.1093/pan/mpj006.
Kim, H.-M., P. Webster, and J. Curry, 2012: Evaluation of short-term climate change prediction in multi-model CMIP5 decadal hindcasts. Geophys. Res. Lett., 39, L10701, https://doi.org/10.1029/2012GL051644.
Knutson, T., F. Zeng, and A. Wittenberg, 2013: Multimodel assessment of regional surface temperature trends: CMIP3 and CMIP5 twentieth-century simulations. J. Climate, 26, 8709–8743, https://doi.org/10.1175/JCLI-D-12-00567.1.
Knutti, R., R. Furrer, C. Tebaldi, J. Cermak, and G. Meehl, 2010: Challenges in combining projections from multiple climate models. J. Climate, 23, 2739–2758, https://doi.org/10.1175/2009JCLI3361.1.
Knutti, R., D. Masson, and A. Gettelman, 2013: Climate model genealogy: Generation CMIP5 and how we got there. Geophys. Res. Lett., 40, 1194–1199, https://doi.org/10.1002/grl.50256.
Lakhraj-Govender, R., and S. Grab, 2018: Assessing the impact of El Niño–Southern Oscillation on South African temperatures during austral summer. Int. J. Climatol., 39, 143–156, https://doi.org/10.1002/joc.5791.
Lenssen, N., G. Schmidt, J. Hansen, M. Menne, A. Persin, R. Ruedy, and D. Zyss, 2019: Improvements in the GISTEMP uncertainty model. J. Geophys. Res. Atmos., 124, 6307–6326, https://doi.org/10.1029/2018JD029522.
Li, J., C. Sun, and F. Jin, 2013a: NAO implicated as a predictor of Northern Hemisphere mean temperature multidecadal variability. Geophys. Res. Lett., 40, 5497–5502, https://doi.org/10.1002/2013GL057877.
Li, J., and Coauthors, 2013b: El Niño modulations over the past seven centuries. Nat. Climate Change, 3, 822–826, https://doi.org/10.1038/nclimate1936.
Mantua, N., S. Hare, Y. Zhang, J. Wallace, and R. Francis, 1997: Pacific interdecadal climate oscillation with impacts on salmon production. Bull. Amer. Meteor. Soc., 78, 1069–1079, https://doi.org/10.1175/1520-0477(1997)078<1069:APICOW>2.0.CO;2.
Maraun, D., and Coauthors, 2010: Precipitation downscaling under climate change: Recent developments to bridge the gap between dynamical models and the end user. Rev. Geophys., 48, RG3003, https://doi.org/10.1029/2009RG000314.
Mason, S., and M. Jury, 1997: Climatic variability and change over southern Africa: A reflection on underlying processes. Prog. Phys. Geogr., 21, 23–50, https://doi.org/10.1177/030913339702100103.
Masson, D., and R. Knutti, 2011: Climate model genealogy. Geophys. Res. Lett., 38, L08703, https://doi.org/10.1029/2011GL046864.
Maxino, C., B. McAvaney, A. Pitman, and S. Perkins, 2008: Ranking the AR4 climate models over the Murray-Darling basin using simulated maximum temperature, minimum temperature and precipitation. Int. J. Climatol., 28, 1097–1112, https://doi.org/10.1002/joc.1612.
MESNZ, 2017: Southern annular mode annual values, 1887–2016. Ministry for the Environment and Statistics New Zealand, accessed 11 February 2020, https://data.mfe.govt.nz/table/89383-southern-annular-mode-annual-values-18872016/metadata/.
Miller, R., and Coauthors, 2014: CMIP5 historical simulations (1850–2012) with GISS ModelE2. J. Adv. Model. Earth Syst., 6, 441–478, https://doi.org/10.1002/2013MS000266.
Morice, C., J. Kennedy, N. Rayner, and P. Jones, 2012: Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 dataset. J. Geophys. Res., 117, D08101, https://doi.org/10.1029/2011JD017187.
NCAR, 2019: The climate data guide: Hurrell wintertime SLP-based Northern Annular Mode (NAM) Index. National Center for Atmospheric Research, accessed 28 January 2019, https://climatedataguide.ucar.edu/climate-data/hurrell-wintertime-slp-based-northern-annular-mode-nam-index.
Notz, D., 2015: How well must climate models agree with observations? Philos. Trans. Roy. Soc., A373, 20140164, https://doi.org/10.1098/rsta.2014.0164.
Pasini, A., P. Racca, S. Amendola, G. Cartocci, and C. Cassardo, 2017: Attribution of recent temperature behaviour reassessed by a neural network method. Sci. Rep., 7, 17681, https://doi.org/10.1038/s41598-017-18011-8.
Perkins, S., A. Pitman, N. Holbrook, and J. McAneney, 2007: Evaluation of the AR4 climate models’ simulated daily maximum temperature, minimum temperature, and precipitation over Australia using probability density functions. J. Climate, 20, 4356–4376, https://doi.org/10.1175/JCLI4253.1.
Power, S., T. Casey, C. Folland, A. Colman, and V. Mehta, 1999: Inter-decadal modulation of the impact of ENSO on Australia. Climate Dyn., 15, 319–324, https://doi.org/10.1007/s003820050284.
Qian, C., and X. Zhang, 2015: Human influences on changes in the temperature seasonality in mid- to high-latitude land areas. J. Climate, 28, 5908–5921, https://doi.org/10.1175/JCLI-D-14-00821.1.
Qu, Z., 2011: A test against spurious long memory. J. Bus. Econ. Stat., 29, 423–438, https://doi.org/10.1198/jbes.2010.09153.
Ramsey, J., 1969: Tests for specification errors in classical linear least squares regression analysis. J. Roy. Stat. Soc., 31, 350–371, https://www.jstor.org/stable/2984219.
Randall, D., and Coauthors, 2007: Climate models and their evaluation. Climate Change 2007: The Physical Science Basis, S. Solomon et al., Eds., Cambridge University Press, 589–662.
Riaz, S. M. F., M. J. Iqbal, and S. Hameed, 2017: Impact of the North Atlantic Oscillation on winter climate of Germany. Tellus, 69A, 1406263, https://doi.org/10.1080/16000870.2017.1406263.
Ropelewski, C., and P. Jones, 1987: An extension of the Tahiti-Darwin Southern Oscillation index. Mon. Wea. Rev., 115, 2161–2165, https://doi.org/10.1175/1520-0493(1987)115<2161:AEOTTS>2.0.CO;2.
Saji, N., and T. Yamagata, 2003: Possible impacts of Indian Ocean Dipole mode events on global climate. Climate Res., 25, 151–169, https://doi.org/10.3354/cr025151.
Sanderson, B., R. Knutti, and P. Caldwell, 2015: A representative democracy to reduce interdependency in a multimodel ensemble. J. Climate, 28, 5171–5194, https://doi.org/10.1175/JCLI-D-14-00362.1.
Schwartz, S., 2012: Determination of Earth’s transient and equilibrium climate sensitivities from observations over the twentieth century: Strong dependence on assumed forcing. Surv. Geophys., 33, 745–777, https://doi.org/10.1007/s10712-012-9180-4.
Spanos, A., 2019: Probability Theory and Statistical Inference: Empirical Modeling with Observational Data. Cambridge University Press, 29 pp.
Steinman, B., M. Mann, and S. Miller, 2015: Atlantic and Pacific multidecadal oscillations and Northern Hemisphere temperatures. Science, 347, 988–991, https://doi.org/10.1126/science.1257856.
Steinschneider, S., R. McCrary, L. Mearns, and C. Brown, 2015: The effects of climate model similarity on probabilistic climate projections and the implications for local, risk-based adaptation planning. Geophys. Res. Lett., 42, 5014–5044, https://doi.org/10.1002/2015GL064529.
Stephenson, D., M. Collins, J. Rougier, and R. Chandler, 2012: Statistical problems in the probabilistic prediction of climate change. Environmetrics, 23, 364–372, https://doi.org/10.1002/env.2153.
Sun, J., K. Zhang, H. Wan, P. Ma, Q. Tang, and S. Zhang, 2019: Impact of nudging strategy on the climate representativeness and hindcast skill of constrained EAMv1 simulations. J. Adv. Model. Earth Syst., 11, 3911–3933, https://doi.org/10.1029/2019MS001831.
Swanson, K., G. Sugihara, and A. Tsonis, 2009: Long-term natural variability and 20th century climate change. Proc. Natl. Acad. Sci. USA, 106, 16 120–16123, https://doi.org/10.1073/pnas.0908699106.
Taylor, J., and R. Buizza, 2004: A comparison of temperature density forecasts from GARCH and atmospheric models. J. Forecasting, 23, 337–355, https://doi.org/10.1002/for.917.
Taylor, K., R. Stouffer, and G. Meehl, 2012: An overview of CMIP5 and the experiment design. Bull. Amer. Meteor. Soc., 93, 485–498, https://doi.org/10.1175/BAMS-D-11-00094.1.
Tebaldi, C., and R. Knutti, 2007: The use of the multi-model ensemble in probabilistic climate projections. Philos. Trans. Roy. Soc., 365, 2053–2075, https://doi.org/10.1098/rsta.2007.2076.
Tol, R., 1996: Autoregressive conditional heteroscedasticity in daily temperature measurements. Environmetrics, 7, 67–75, https://doi.org/10.1002/(SICI)1099-095X(199601)7:1<67::AID-ENV164>3.0.CO;2-D.
Tol, R., and A. de Vos, 1993: Greenhouse statistics-time series analysis. Theor. Appl. Climatol., 48, 63–74, https://doi.org/10.1007/BF00864914.
Tsonis, A. A., K. L. Swanson, G. Sugihara, and P. A. Tsonis, 2009: Climate change and the demise of Minoan civilization. Climate Past, 6, 525–530, https://doi.org/10.5194/cp-6-525-2010.
Tyson, P., and R. Preston-Whyte, 2000: The Weather and Climate of Southern Africa. Oxford University Press, 396 pp.
Vihma, T., and Coauthors, 2019: Effects of the tropospheric large-scale circulation on European winter temperatures during the period of amplified Arctic warming. Int. J. Climatol., 40, 509–529, https://doi.org/10.1002/joc.6225.
Wallace, J., C. Deser, B. Smoliak, and A. Phillips, 2015: Attribution of climate change in the presence of internal variability. Climate Change: Multidecadal and Beyond, C.-P. Chang et al., Eds., World Scientific, 1–29.
Wang, G., and W. Cai, 2013: Climate-change impact on the 20th-century relationship between the southern annular mode and global mean temperature. Sci. Rep., 3, 2039, https://doi.org/10.1038/srep02039.
Weigel, A., R. Knutti, M. Liniger, and C. Appenzeller, 2010: Risks of model weighting in multimodel climate projections. J. Climate, 23, 4175–4191, https://doi.org/10.1175/2010JCLI3594.1.