Assessment of Large-Scale Indices of Surface Temperature during the Historical Period in the CMIP6 Ensemble

A. Bodas-Salcedo aMet Office Hadley Centre, Exeter, United Kingdom

Search for other papers by A. Bodas-Salcedo in
Current site
Google Scholar
PubMed
Close
,
J. M. Gregory aMet Office Hadley Centre, Exeter, United Kingdom
bNational Centre for Atmospheric Science, University of Reading, Reading, United Kingdom

Search for other papers by J. M. Gregory in
Current site
Google Scholar
PubMed
Close
,
D. M. H. Sexton aMet Office Hadley Centre, Exeter, United Kingdom

Search for other papers by D. M. H. Sexton in
Current site
Google Scholar
PubMed
Close
, and
C. P. Morice aMet Office Hadley Centre, Exeter, United Kingdom

Search for other papers by C. P. Morice in
Current site
Google Scholar
PubMed
Close
Free access

Abstract

We develop a statistical method to assess CMIP6 simulations of large-scale surface temperature change during the historical period (1850–2014), considering all time scales, allowing for the different unforced variability of each model and the observations, observational uncertainty, and variable ensemble size. The generality of this method, and the fact that it incorporates information about the unforced variability, makes it a useful model assessment tool. We apply this method to the historical simulations of the CMIP6 multimodel ensemble. We use three indices that measure different aspects of large-scale surface air temperature change: global mean, hemispheric gradient, and a recently developed index that captures the sea surface temperature (SST) pattern in the tropics (SST#; see Fueglistaler and Silvers). We use the following observations: HadCRUT5 for the first two indices, and AMIPII and ERSSTv5 for SST#. In each case, we test the hypothesis that the model’s forced response is compatible with the observations, accounting for unforced variability in both models and observations as well as measurement uncertainty. This hypothesis is accepted more often (75% of the models) for the hemispheric gradient than for the global mean, for which half of the models fail the test. The tropical SST pattern is poorly simulated in all models. Given that the tropical SST pattern can strongly modulate the relationship between energy imbalance and global-mean surface temperature anomalies on annual to decadal time scales (short-term feedback parameter), we suggest this should be a focus area for future improvements due to its potential implications for the global-mean temperature evolution in decadal time scales.

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Alejandro Bodas-Salcedo, alejandro.bodas@metoffice.gov.uk

Abstract

We develop a statistical method to assess CMIP6 simulations of large-scale surface temperature change during the historical period (1850–2014), considering all time scales, allowing for the different unforced variability of each model and the observations, observational uncertainty, and variable ensemble size. The generality of this method, and the fact that it incorporates information about the unforced variability, makes it a useful model assessment tool. We apply this method to the historical simulations of the CMIP6 multimodel ensemble. We use three indices that measure different aspects of large-scale surface air temperature change: global mean, hemispheric gradient, and a recently developed index that captures the sea surface temperature (SST) pattern in the tropics (SST#; see Fueglistaler and Silvers). We use the following observations: HadCRUT5 for the first two indices, and AMIPII and ERSSTv5 for SST#. In each case, we test the hypothesis that the model’s forced response is compatible with the observations, accounting for unforced variability in both models and observations as well as measurement uncertainty. This hypothesis is accepted more often (75% of the models) for the hemispheric gradient than for the global mean, for which half of the models fail the test. The tropical SST pattern is poorly simulated in all models. Given that the tropical SST pattern can strongly modulate the relationship between energy imbalance and global-mean surface temperature anomalies on annual to decadal time scales (short-term feedback parameter), we suggest this should be a focus area for future improvements due to its potential implications for the global-mean temperature evolution in decadal time scales.

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).

Corresponding author: Alejandro Bodas-Salcedo, alejandro.bodas@metoffice.gov.uk

1. Introduction

The historical record of near-surface air temperature (SAT) is widely used as a performance metric for climate models (e.g., Braganza et al. 2003; Reichler and Kim 2008). The time series of annual-mean anomalies is a benchmark against which models are tested, and it has been used to assess the credibility of a model’s ability to provide information on future changes (e.g., Brunner et al. 2020). Recent research suggests that the later part of the historical period (1980 onward) contains information about the sensitivity of Earth’s climate to external forcing (Flynn and Mauritsen 2020; Dittus et al. 2020), although this relationship may not be as strong as suggested due to common model biases in the simulation of historical sea surface temperature (SST) patterns (Andrews and Webb 2018; Ceppi and Gregory 2017), the sensitivity to biomass aerosols (Fasullo et al. 2022), or a non-negligible contribution of internal variability on multidecadal trends (McKinnon and Deser 2018). The tropical SST patterns are strongly connected to regional precipitation anomalies, of relevance for the accurate drought-inducing teleconnections (e.g., Annamalai et al. 2013; Zinke et al. 2021). Also, the radiative forcing over the historical record is uncertain, mainly due to the role of aerosols (e.g., Smith et al. 2021), with important implications for the historical warming shown by models (e.g., Wang et al. 2021; Zhang et al. 2021). Potentially, all this information can be used to improve the model’s response to external forcing subject to the constraints of process observations. However, there is no common approach on how to incorporate the historical record into model development.

Several modeling centers have directly “calibrated” or “tuned” historical simulations (i.e., adjusted them to improve realism of climate change simulation) during the development of the models used for phase 6 of the Climate Model Intercomparison Project (CMIP6; Eyring et al. 2016), while others did not use historical simulations during development. For example, the historical warming of the MPI-ESM1.2-LR model was tuned by reducing its climate sensitivity during its development (Mauritsen et al. 2019). Boucher et al. (2020) describe the developments and performance of the IPSL‐CM6A‐LR model. Although historical simulations were not used as part of the development, the r1i1p1f1 simulation was selected qualitatively among the first ∼12 available historical members, based on a few key observables of the historical period. During the development of the Energy Exascale Earth System Model, version 1 (E3SMv1), a historical simulation was performed with a near-final version of the model, but no action was taken to change the historical performance in the final version (Golaz et al. 2019).

The use of historical runs (or any coupled run with transient forcing) for tuning is not part of the Met Office Unified Model (UM) development protocol. The Hadley Centre models submitted to CMIP6 were not tuned to the historical record, although several model improvements were added to ensure that the total present-day radiative forcing was positive (Mulcahy et al. 2018). This approach was revised in the 2020 UM Users Workshop, where it was agreed that one of the key model errors was the simulation of the historical record. As a result, a Prioritised Evaluation Group (PEG) was created with the objective of improving the simulation of the historical global-mean surface temperature record. Also, in a recent review of the UM’s Global Configuration (GC) development protocol, it was agreed that a small ensemble of historical simulations will be run during the final stage of the development cycle, opening the option to implement model changes that target the performance of the simulation of the historical record before the final configuration is delivered to the users. In this paper we present the first step toward incorporating historical information into the UM’s development process. We develop a statistical method to test whether simulations of large-scale surface temperature change are realistic during the historical period (1850–2014). The method is applied to annual-mean time series of three surface temperature indices: global mean, hemispheric gradient, and a recently developed index that captures the SST pattern in the tropics (SST#; Fueglistaler and Silvers 2021). We test the historical simulations of the CMIP6 ensemble and post-CMIP6 versions of the HadGEM3 and UKESM models. We use the term “realistic” in a relative manner: a model that performs well against the tests described here can do so due to compensating errors (e.g., between forcings and feedbacks). Consequently, those models that we label as realistic in the present study could nonetheless be rejected once other metrics with additional observational evidence or process understanding are considered. This shortcoming is not specific to this methodology, and the method we propose here should be used along a wide range of diagnostics to provide a detailed assessment. The structure of the paper is as follows. Section 2 describes the observational and model data. The statistical methodology is detailed in section 3, and section 4 presents the results of the method applied to the CMIP6 historical ensemble. Finally, section 5 discusses the results and conclusions.

2. Model data and observations

We use near-surface air temperature (CMIP variable “tas”) data from the piControl and historical experiments of the CMIP6 archive, which are atmosphere–ocean coupled simulations. The piControl are unforced simulations with forcing agents set at preindustrial levels (year 1850). After a spinup period, the CMIP6 protocol requests a minimum of 500 simulation years, but not all models fulfil this criterion. We explain how we deal with different lengths of the piControl time series in the next section.

The CMIP6 protocol (Eyring et al. 2016) recommended that the historical experiments are run with the current best estimates of the time-evolving datasets of forcing agents: atmospheric composition, solar irradiance, natural and anthropogenic aerosols, and land-use change, but not all institutions followed the protocol. They branch from the piControl simulation, running from 1850 to 2014 (165 years). The CMIP6 protocol recommends running at least three historical simulations, branching from different points in the piControl simulations. We use 40 piControl simulations from the CMIP6 ensemble, plus simulations from GC4.0-LL and UKESM1.1-LL (Mulcahy et al. 2023), models developed after CMIP6.

We use three different observational datasets of surface temperature: the Met Office Hadley Centre/Climatic Research Unit global surface temperature dataset, version 5 (HadCRUT5.0.1.0; Morice et al. 2021), the Program for Climate Model Diagnosis and Intercomparison (PCMDI) SST reconstruction (Hurrell et al. 2008; Taylor et al. 2000), and the Extended Reconstructed Sea Surface Temperatures, version 5 (ERSSTv5; Huang et al. 2017a,b). The baseline period used for all historical datasets is 1880–1919.

HadCRUT5 provides temperature anomalies on a latitude–longitude rectangular grid. Two variants of the same dataset are provided: a noninfilled version, with data in grid boxes where measurements are available; and a more spatially complete version. For global and regional time series, the HadCRUT5 analysis error model contains two terms (Morice et al. 2021): the analysis error (εa) and the coverage error (εc). The analysis error combines the errors from the Gaussian process used in the statistical infilling and the instrumental errors. The analysis grids are not generally globally complete, particularly in the early observed record. Regions are omitted where there are insufficient data available to form reliable gridcell estimates. The coverage error represents the uncertainty in spatial averages arising from these unrepresented regions. The analysis error is represented by the 200 realizations of the historical record, whereas the coverage error is reported as a time series of standard deviations. We use the more spatially complete version, also termed as “HadCRUT5 analysis.” The HadCRUT5 analysis dataset uses a statistical method to extend temperature anomaly estimates into regions for which the underlying measurements are informative. This makes it more suitable for comparisons of large-scale regional average diagnostics against spatially complete model data, although variability in “infilled” regions will be lower than where observed measurement data is present (Jones 2016). We use the HadCRUT5 analysis as a reference dataset for two of the indices: global mean and hemispheric gradient. We use the global means calculated by averaging the hemispheric means, as recommended by Morice et al. (2021).

The SST# index is defined as the difference between the average of the warmest 30% SSTs (actual values, not anomalies) and the domain average. The domain used for this particular metric is the tropics, from 30°S to 30°N. This index represents the difference in SSTs between the convective regions and the tropical average, and it explains the anomalies in low cloud cover (and cloud radiative feedbacks) over the historical record due to changes in SST patterns (Fueglistaler and Silvers 2021). The index is calculated using monthly mean SSTs, and then annual averages are calculated. The same process is followed for both models and observations. Since this index cannot be calculated from local anomalies, a dataset that provides absolute temperature estimates is required. The PCMDI dataset provides monthly mean sea surface temperature and sea ice concentration data from 1870 to the present on a regular latitude–longitude grid. These data are designed to be used as boundary conditions for atmosphere-only simulations. They use the AMIPII midmonth calculation (Taylor et al. 2000), which ensures that the monthly mean of the time-interpolated data is identical to the input monthly mean. Following the convention in other studies, we refer to this dataset as PCMDI/AMIPII. SST# is subject to a large observational uncertainty (Fueglistaler and Silvers 2021), attributed to the different methodologies used to provide information where observations are not available. Given that the PCMDI/AMIPII dataset does not provide a comprehensive error characterization, we use the ERSST5 to test the robustness of our results to the observational uncertainty in SST#. We have chosen the PCMDI/AMIPII and ERSST5 datasets because they fall at opposite ends of the spectrum of SST# anomalies provided by observational datasets, spanning the range of structural uncertainties in the observational reconstructions of SST#. There is evidence of differences between near-surface atmosphere temperature and surface temperature diagnostics (e.g., Richardson et al. 2016). The Intergovernmental Panel on Climate Change Sixth Assessment Report (IPCC AR6; Gulev et al. 2021) quantifies the global-mean uncertainty of long-term trends by at most 10% in either direction, with low confidence in the sign of any difference in long-term trends. Jones (2020) supports the use of global near-surface air temperature model diagnostics with blended datasets of observed temperature changes.

3. Methodology

Let Ho(t) be the time series of the observed historical record anomalies of any given surface temperature index. We decompose it as HO(t) = S(t) + UO(t) + EO(t), where S(t) represents the forced signal, UO(t) is the unforced variability, and EO(t) is the total observational error. Similarly, for a given model we decompose any historical simulation of the same index as HM(t) = S(t) + DM(t) + UM(t), where DM(t) represents a discrepancy term or error in the forced response, and UM(t) is the model’s unforced variability.

If we hypothesize that the model’s forced response is realistic [i.e., DM(t) = 0], then HM(t) − HO(t) = UM(t) − UO(t) − EO(t). We can test this hypothesis by comparing HM(t) − HO(t) with the expected distribution of UM(t) − UO(t) − EO(t). In general, we have more than one realization of a model’s historical experiment, each of them with a different realization of the model unforced variability. Since we only have a single sample of the real world’s unforced variability, tests on individual ensemble members are not independent. We avoid this problem by formulating the test for ensemble means noting that S(t) [and DM(t)] are the same for each ensemble member: HM¯(t)HO(t)=UM¯(t)UO(t)EO(t). The overbars represent the ensemble mean. With this formulation, the observations are used only once for each model ensemble with the contribution of their internal variability remaining constant with ensemble size (unlike the contribution of the model internal variability, which reduces with ensemble size).

The problem is now reduced to the characterization of the distribution of the right-hand side of the equation. Ideally, UO should be characterized from a long time series of the real system under no external forcing. Paleoclimatic proxy reconstructions are available only for restricted regions, and therefore not representative of the large spatial scales of interest for this study, as well as having larger errors. They have the additional complication that the external forcing is not zero during the paleoclimate record. Therefore, we instead assume that unforced simulations of the multimodel ensemble provide us with a reasonable estimate of the real world’s unforced variability, an approach that has been used in other studies (e.g., Gillett et al. 2002). Hence, we characterize UM¯ and UO using piControl simulations.

The subsections below describe the next steps in the methodology: calculation of the observational error term; estimation of the distribution of UM¯(t)UO(t)EO(t) using piControl simulations; definition of the metric and calculation of its control distribution; testing the historical ensembles; interpreting the tests.

a. Calculation of the observational error

For the HadCRUT5 observations, we combine analysis and coverage errors into a single error term (Eo) as follows. We add samples of a normally distributed random variable of zero mean and variance Var[εc(t)] to the residuals of the 200 realizations of the HadCRUT5 analysis. The total error inherits the autocorrelation characteristics of the analysis error, which is correlated in time. The error term Eo is then modeled by drawing random samples from this 200-member ensemble of realizations. The time dependence of Eo for the global mean is shown in Fig. 1. The black lines show the 95% confidence interval [comparable to the orange range in Fig. 2 of Morice et al. (2021)]. In general, the observational error decreases with time, apart from periods of international conflicts. The time dependence of Eo for the hemispheric difference is very similar to that of the global mean, but larger in magnitude.

Fig. 1.
Fig. 1.

Total observational error (Eo) of the global-mean metric. The gray lines show the residuals of individual realizations of the HadCRUT5 global-mean analysis, including a randomly generated contribution that accounts for the coverage error. The black lines are the bounds of the 95% confidence interval.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

For the SST# index, we do not include an error term due to lack of error information in the observational datasets. However, we repeat the analysis with two different observational datasets to test the robustness of the results.

b. Construction of the unforced distribution of differences

Here we are concerned with the generation of random samples of UM¯(t)UO(t)EO(t) using piControl simulations. Although the piControl simulations are started after a spinup that is discarded, they are not in complete equilibrium (Eyring et al. 2016). For each model’s control time series, we construct a linearly detrended time series X(t) using the entire length of each control simulation. This increases the likelihood of adding noise to the detrended data (Sen Gupta et al. 2013; Jones et al. 2013), but some models show significant unforced variability on centennial time scales, which would be spuriously reduced by detrending shorter segments (Parsons et al. 2020).

We split the detrended control time series X(t) into nonoverlapping segments 165 years long, equal to the length of the CMIP6 historical simulations. The piControl simulations differ in length between models, so to give (nearly) equal weight to each model we use up to 3 segments of each piControl simulation. We also decide to retain models with shorter control time series. With these constraints, we use 41 piControl simulations, 32 of them with 3 segments, 5 with 2 segments, and 4 with only 1 segment. This gives 110 segments of piControl simulations of equal length. Then, we subtract the time average of the segment, so that the mean value of each segment is zero by construction. We call UpiControl(t) to these detrended, 165-yr-long, zero-average piControl samples of the unforced variability, which we use to generate samples of UM¯(t;Nm)UO(t)EO(t), with Nm being the model’s historical simulation ensemble size. We sample both UM(t; Nm) and Uo(t) from the ensemble of 110 UpiControl(t) segments. For instance, for a historical ensemble with 10 members, we randomly draw 11 UpiControl(t) segments, and average 10 of them to calculate UM¯(t;Nm), and use the other one as UO(t). The UpiControl(t) samples are drawn from the pool of piControl segments of all models, not only of the model whose historical ensemble is being tested. For global-mean surface air temperature (GMSAT) and hemispheric difference, Eo(t) is randomly sampled from the ensemble of 200 realizations of the total HadCRUT5 total error as explained above, and the three time series are combined. We repeat this process 10 000 times for each historical ensemble. For constructing the distribution of unforced differences, the only information extracted from the historical ensemble is its size Nm.

Other approaches for estimating internal variability exist, and a recent study by Olonscheck and Notz (2017) provides a brief description of the two main avenues and their caveats. We have used a method that is based on piControl simulations, which may be unsuitable if the unforced variability is state dependent. However, Olonscheck and Notz (2017) show that the variability remains largely unchanged for historical simulations, even for those variables like sea ice area that show large changes in simulations of future warming. Therefore, we assume that the variability remains unchanged for the temperature indices used here and the amount of climate change in the historical period.

c. Definition of the metric: Number of exceedances

Our interest is to characterize the quality of a historical ensemble of simulations against observations. As a metric of quality, in the next section 3d we compute HM¯(t)HO(t), for each model, and count the number of times that a running mean of the absolute value of this quantity exceeds a given value.

The samples of UM¯(t)UO(t)EO(t) generated in the previous section 3b serve as the basis to construct unforced distributions of this metric.

We define E(T, y, Nm), as the number of exceedances above a threshold T (in K) of a filtered time series of absolute values of |UM¯(t)UO(t)EO(t)|. The filter applied is a running mean with a window length of y years. We define a two-dimensional rectangular grid in T and y, ranging between 0 and 0.3 K, and between 1 and 10 years, respectively. We then calculate 10 000 values of E for each combination (T, y). We use an absolute threshold in Kelvins, but the method could be easily reformulated in terms of a threshold defined in units of standard deviations of the unforced variability.

Figure 2 presents an example of this process for the GMSAT, leading to the calculation of one sample of E(0.1, 10, 5). The blue line shows one sample of UM¯(t)UO(t)EO(t). The red line is the smoothed time series of the absolute value of the blue time series, using a y = 10-yr running mean. The green line represents the temperature threshold T = 0.1 K. The value of E(0.1, 10, 5) is the number of points from the red line that lie above the green line.

Fig. 2.
Fig. 2.

Graphical example of the calculation of the number of exceedances for a given pair of segments of the piControl simulations. This example is for GMSAT, but the method is the same for all indices. The blue line shows the difference between the two piControl segments that provides a sample of UMUO. The red line is the absolute value of the 10-yr running mean of the blue line. The green line represents the exceedance threshold, 0.1 K in this example. The number of exceedances is the number of red points above the green line.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

We construct a second metric following the same steps, but using the variance-scaled samples σM/σ[UM¯(t)UO(t)]EO(t), where σM is the model’s standard deviation of the linearly detrended piControl anomalies, and σ is the multimodel mean standard deviation of all the linearly detrended piControl anomalies. This provides a variance-scaled set of samples of control distributions of exceedances that accounts for differences in the variance of the unforced variability across different models. We label this second metric as Es(T, y, Nm).

From these sets of samples of E(T, y, Nm) and Es(T, y, Nm), we construct empirical quantile distribution functions QZ(p; T, y, Nm), which give the number of exceedances for a given cumulative probability p. Z is a generic discrete random variable name that refers to either E or Es. For simplicity, from now on we omit the dependency with the ensemble size Nm.

In summary, for each historical ensemble, we have calculated two (one with variance scaling and one without) empirical quantile functions in each point of the (T, y) grid. Figure 3 shows examples of QE for a historical ensemble of 3 members. For a given T and y, the probability is p that the number of exceedances (occurring during a 165-yr historical integration) will be less than QZ(p; T, y). There is zero chance that the number of exceedances will be less than zero, a small chance that it will be less than a small number, and we are certain that it will be less than a sufficiently large number (at most 165). Thus, QZ increases with p (Figs. 3a,b). For any given p, the expected number of exceedances QZ is smaller for a longer meaning period y (Fig. 3a) or a higher threshold T (Fig. 3b).

Fig. 3.
Fig. 3.

Empirical quantile distribution functions QZ(p; T, y, Nm) for an ensemble size Nm = 3. Color lines show examples for (a) different lengths of the averaging window (in years, as shown in the legend) for an exceedance threshold of 0.11 K and (b) different exceedance thresholds (in K, as shown in the legend) for an averaging window of 6 years.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

d. Testing ensembles of historical simulations

We test each historical ensemble by comparing the number of exceedances of the difference between the ensemble mean and the observations against the expected number of differences given by the control distribution. First, we calculate HM¯(t)HO(t), which we use as input to calculate the number of exceedances for each point in the (T, y) grid, Eh(T, y), where the subscript h denotes that this is calculated from a historical ensemble, and HO(t) is the HadCRUT5 analysis ensemble mean. The linear drift of the piControl is subtracted from the historical time series. We then perform two one-tailed tests, each with a significance level α. This is done by comparing Eh(T, y) against the empirical quantile function QZ(p; T, y), separately for Z = E and Z = Es. In each case, when either Eh(T, y) > QZ(1 − α; T, y) or Eh(T, y) < QZ(α; T, y), the historical ensemble is flagged as incompatible in that point of the (T, y) grid. That is, we reject the null hypothesis that the difference between the historical simulation and observations is consistent with unforced variability if the number of times Z that the difference between them exceeds the threshold T in y year means is either much larger than expected (upper-tail test), or much smaller than expected (lower-tail test).

Figure 4 shows an example for the upper-tail test applied to the entire (T, y) grid, using a significance level α = 0.05. For illustrative purposes, it is helpful to choose a model like EC-Earth3-Veg with large multidecadal unforced variability (Parsons et al. 2020). The filled contours in Figs. 4a and 4c show QZ(p = 0.95; T, y) for Z = E in and Z = Es, respectively. The shape of QZ is very similar for all models and ensemble sizes. As shown also in Fig. 3, QZ gets smaller as T gets larger for a given y (less likely to exceed a higher threshold), and smaller as y gets larger for a given T (less likely to for a longer time mean to exceed a threshold), although the dependency on y is much weaker.

Fig. 4.
Fig. 4.

Tests of the historical ensemble of EC-Earth3-Veg: tests (a) without and (c) with variance scaling. The filled contours show QZ(p = 0.95; T, y) for (a) Z = E and (c) Z = Es. These surfaces show the expected number of exceedances normalized by 165 (maximum number of exceedances) for the 95th percentile (p = 0.95, as noted in the bottom-left corner) of the piControl distributions in each point of the (T, y) grid. Observational uncertainty is included when available. The dots show the points in the (T, y) grid where the historical ensemble fails the test, i.e., Eh(T, y) > QZ(p = 0.95; T, y). (b) The last 500 years of the piControl simulation of the model tested are shown. (d) The annual-mean historical anomalies of the temperature index being tested: model’s ensemble mean (black) and range (gray) and the observed anomalies (green). The historical anomalies in (d) are calculated with respect to the 1880–1919 time average. The legend in (d) shows the number of historical realizations used in the calculation of the ensemble mean.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

The dotted regions in the (T, y) grid mark where the null hypothesis is rejected (Eh > QZ). In this example, the test without variance scaling (Fig. 4a) shows many rejections, whereas the variance-scaled test (Fig. 4c) shows none. This contrast implies that the unforced variance of EC-Earth3-Veg is larger than the multimodel mean variance. The large variance increases the number of exceedances in the test without variance scaling, whereas variance scaling raises the control surface QZ(p = 0.95; T, y), making it easier for the model to pass the test. This scaling is trying to penalize those models that pass the nonscaled test due to a very small unforced variability compared to the multimodel mean variance, which we assume to be the best estimate of the unforced variability.

How much of the (T, y) space is needed to fail the statistical test for the model as a whole to be deemed “incompatible”? We have divided the (T, y) grid into 29 × 10 points, so we would expect a good model to fail at 290 × 10 × 0.05 ≈ 15 points in the (T, y) grid just by chance if there was no correlation in the number of exceedances between (T, y) neighbors. Because the time scale and the threshold are correlated, if incompatibility occurs it is likely to cover patches of adjacent points in the (T, y) grid. Given that our main aim is to apply this method to intercompare models, we do not define a single, strict threshold for labeling a model as incompatible with the observations. Instead, we use the following guidance: models with less than 10 failures (dots) pass the test; models that fail between 10 and 20 times are considered marginal; models with more than 20 failures are labeled as incompatible.

The lower-tail test [Eh(T, y) < QZ(p = 0.05; T, y)] can be presented in a similar way, but only one of the models tested fails this test (FGOALS-g3, and only marginally). A model fails this test if its historical simulation deviates less than expected from reality, which can happen only if it has both a realistic forced response and unrealistically small unforced variability. It could be that the lower-tail test rarely fails because models in general do not have a realistic forced response. For the remainder of the paper, we discuss the results of the upper-tail test only.

We have tested the sensitivity of the results to the order of the polynomial used for the detrending of the piControl time series. The results are largely insensitive to the use of quadratic instead of linear detrending, so we conclude that our method is robust with respect to the detrending method. If this test is applied to metrics that require nonlinear detrending we would recommend the use of more flexible methods with better properties (e.g., splines).

4. Results and discussion

In this section we present results for three temperature indices: global mean, hemispheric difference, and SST#. These three metrics capture important complementary information about key aspects of temperature change over the historical record. The global mean has been widely used as the most fundamental metric of climate change. The hemispheric difference captures the influence of anthropogenic aerosols during the historical period, as emissions are dominated by sources in the Northern Hemisphere, and it is reasonably independent of the global mean (Braganza et al. 2003). The changes in tropical SST pattern control the sign and strength of low cloud feedbacks in response to CO2 forcing (e.g., Miller 1997; Gregory and Andrews 2016), making it an important metric of the historical record.

a. Global mean

Figure 5 shows the tests without variance scaling. Out of the 40 models analyzed, 20 of them can be labeled as incompatible with the observed record, according to this test. These are models that show large, dotted areas. The other 20 models do not fail the test at all or only in a few instances. Models tend to fail the test for large exceedance thresholds T, with little dependence in the length of the averaging window y, that is, they tend to fail along entire “columns” in the contour plot.

Fig. 5.
Fig. 5.

Multimodel summary of the test without variance scaling applied to the global-mean surface air temperature index. The number of exceedances is normalized by 165, the maximum number of exceedances given by the length of the historical record. The test uses the 95th percentile (p = 0.95, as noted in the bottom-left corner) of the unforced distributions.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

When the variance-scaled test is applied (Fig. 6), 22 models are labeled as incompatible with the observed record, and 18 models pass the test. No models are in the marginal category. The variance-scaled test rejects 5 additional models, and labels as compatible 3 models that were rejected by the test without variance scaling. This is because these models have a piControl variance that is very different to the multimodel mean variance.

Fig. 6.
Fig. 6.

Multimodel summary of the test with variance scaling applied to the annual global-mean surface air temperature index. The number of exceedances is normalized by 165, the maximum number of exceedances given by the length of the historical record. The test uses the 95th percentile (p = 0.95, as noted in the bottom-left corner) of the unforced distributions.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

We have presented an example of a model with large unforced variability in Fig. 4. Figure 7 shows an example for a model with a small unforced variability: MRI-ESM2-0. The control surface of the number of exceedances is lowered by the variance scaling, making it easier for the model to fail the test. Since we are not making any assumption about the quality of piControl simulations of individual models, the variance scaling method is an attempt to enable a fair comparison, when using other models with different unforced variability.

Fig. 7.
Fig. 7.

As in Fig. 4, but for model MRI-ESM2-0.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

These two examples show how each model’s characteristics of its unforced variability are incorporated into the test. This is particularly helpful when the ensemble size of historical simulations is small, which makes difficult the assessment of the impact of the unforced variability by visual inspection. It must be emphasized that we treat all piControl simulations as equally plausible, but the method could be refined by bringing in external information to better characterize the unforced variability of the real system. We expand on this below when we discuss the caveats of the methodology.

b. Hemispheric gradient

Figure 8 shows the tests without variance scaling for the hemispheric gradient index. Out of the 40 models analyzed, 8 are labeled as incompatible with the observed record, 1 is marginal, and 31 pass the test. If variance scaling is used (Fig. 9), the results are very similar, with 7 models rejected, 3 marginal, and 29 passing the test. As with the global mean, failures tend to happen along “columns,” that is, for all averaging window lengths. It is interesting to note that, contrary to the global mean, the hemispheric gradient shows more failures for small exceedance thresholds.

Fig. 8.
Fig. 8.

As in Fig. 5, but for the hemispheric gradient surface air temperature index.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

Fig. 9.
Fig. 9.

As in Fig. 6, but for the hemispheric gradient surface air temperature index.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

In CESM2 there is a strong sensitivity of the hemispheric gradient to the variability in biomass emissions from 40° to 70°N, which leads to spurious warming in the late historical period (Fasullo et al. 2022). However, this model passes the global and hemispheric tests, which may suggest the presence of compensating biases. This highlights the importance of having a large battery of diagnostics capable of assessing model performance from different angles.

c. SST#

Figures 10 and 11 show the multimodel ensemble results for SST#, without and with scaling of the unforced variance, respectively. The test without variance scaling rejects all CMIP6 models. Only one model is not rejected, namely GISS-E2-1-G, when variance scaling is used. The GISS models are examples of models with large unforced variability (Fig. 12). Unlike in previous examples with large unforced variability on long time scales (Fig. 4), the unforced variability of the GISS models is dominated by high-frequency (annual) variability. Given that the observational record does not show such a large high-frequency variability, we conclude that the test without variance scaling is probably a better assessment of the performance of the GISS models. This conclusion is also supported by Orbe et al. (2020) who show that GISS-E2-1-G is an outlier in the simulation of ENSO.

Fig. 10.
Fig. 10.

As in Fig. 5, but for the SST# index. The observational SST# index is calculated using the PCMDI/AMIPII dataset.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

Fig. 11.
Fig. 11.

As in Fig. 6, but for the SST# index. The observational SST# index is calculated using the PCMDI/AMIPII dataset.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

Fig. 12.
Fig. 12.

As in Fig. 4, but for the SST# index of model GISS-E2-1-G. The green line in (d) shows the PCMDI/AMIPII observational estimate.

Citation: Journal of Climate 36, 7; 10.1175/JCLI-D-22-0398.1

SST# is subject to a large observational uncertainty (Fueglistaler and Silvers 2021). The observations show very good agreement during the satellite era (1979 onward), where the spatial coverage is very dense, but they show large discrepancies before satellite data were available. The differences are attributed to the different methodologies used to provide information where observations are not available. Given that the PCMDI/AMIPII dataset does not provide a comprehensive error characterization, we have repeated the tests using the ERSST5 dataset to test the robustness of our conclusions. We have chosen the PCMDI/AMIPII and ERSST5 datasets because they fall at opposite ends of the spectrum of SST# anomalies provided by observational datasets, giving us information about structural uncertainties in the observational reconstructions of SST#. The results with ERSST5 (not shown) are similar to the comparisons against PCMDI/AMIPII, all the CMIP6 models are rejected by both tests, with and without variance scaling. This confirms that the results are robust with respect to observational uncertainty in SST#.

The fact that the entire CMIP6 ensemble performs poorly in the SST# index is consistent with previous studies showing that models in general do not reproduce the Pacific SST trends of recent decades (Seager et al. 2019; Gregory et al. 2020; Wills et al. 2022), and it has potential implications beyond the models’ performance over the historical period.

Unlike for the two other indices, there is no consensus either that SST# should contain a forced signal or that it is part of the unforced variability of the climate system. Some recent studies suggest that tropical Pacific SST patterns observed during the recent decades could arise from internal climate variability (e.g., Olonscheck et al. 2020; Watanabe et al. 2021). Other studies suggest that the SST patterns are consistent with a forced response to greenhouse forcing (Seager et al. 2019) that can be explained with simple models (Clement et al. 1996), or with a potential role for volcanic or anthropogenic aerosols in setting the recent patterns (Gregory et al. 2020; Heede and Fedorov 2021; Dittus et al. 2021). If the observed evolution of SST# is not forced, no model ensemble mean can be expected to agree with the observations. In that case, if a model fails the test, it means that its simulation of SST# variability has the wrong magnitude. On the other hand, if SST# is forced, the rejection of the test means that the model does not replicate the forced response. In this case, if a large number of models fail the test it could imply a common bias in the forced response. In either case, a rejection of the test indicates some aspects of the model performance are wrong somehow. Additional process-level analysis and physical hypothesis testing is required to improve our understanding of the causes behind the model errors.

d. Caveats and interpretation of the tests

The results above show how the methodology presented here can be used to assess historical simulations during the model development process. We have applied it to surface temperature indices, but it can be applied to any variable for which observational estimates over the historical period exist. However, the methodology presents some interpretation challenges and caveats. How do we interpret a rejection of the null hypothesis that the model’s forced response is realistic? Can we definitively conclude that there is a problem with the model’s forced signal? There is a chance that the null hypothesis is wrongly rejected although true; that is a type I error, whose probability is the chosen significance level. If we reject the null hypothesis, we must have an alternative hypothesis. Potential alternatives are as follows: there is a problem with the model’s forced signal; our model-based unforced variability is biased; the forcing is wrong. We do not have a statistical means to estimate the probability of these systematic errors.

It is also worth mentioning that agreement between the observations and simulations might be due to compensating errors. Potential problems that could contribute to compensating errors concern the following: aerosol radiative forcing and aerosol–cloud interactions (e.g., Paulot et al. 2018; Rieger et al. 2020; Wang et al. 2021; Fasullo et al. 2022); tropical SST patterns and their role on global radiative feedbacks (Ceppi and Gregory 2017; Andrews and Webb 2018). The unforced distributions used to define the exceedance quantile functions are constructed from piControl simulations. This assumes that the multimodel ensemble provides us with a good representation of the unforced variability, which is not necessarily true. As we have shown above when discussing the results of the variance-scaled results, there exist large discrepancies in the representation of unforced variability between models (Parsons et al. 2020), which raises questions about the ability of at least some models to provide a good estimate of unforced variability. If the unforced variability estimated from the multimodel ensemble is biased, then our method will be biased. One avenue that could be explored for improving this would be to incorporate information from proxy temperature reconstructions into a correction of the unforced variability. However, the use of proxy reconstructions is not free from problems. The reconstructions are for restricted regions where there are proxies (e.g., PAGES 2k Consortium 2013), and much of their variability is forced by volcanoes and solar variability (PAGES 2k Consortium 2019). In any case, a failure of this type would imply that the models piControl simulations are wrong (rather than the forced signal necessarily), so the test would still be highlighting a problem.

Our test with scaled variance is an initial attempt to identify outliers, but more sophisticated methods could be used. Perhaps a better estimate of the unforced variability could be achieved by restricting the set of models used to form the distributions of internal variability. This selection could be based on how models represent observational estimates of the spectra of some modes of variability (Fasullo et al. 2020). For SST#, basing this selection on some metric of ENSO could be particularly useful (Planton et al. 2021). Screening out models would reduce the number of piControl simulations, so this would have an impact the robustness of the unforced distributions.

A second caveat is the differing sizes of the historical ensembles. Out of the 40 models analyzed here, only 4 have historical ensembles with more than 10 members, and 31 models have 5 or fewer historical simulations. Large ensembles will provide more robust tests. A model with a small ensemble will provide a less precise estimate of the ensemble mean, making the result of the test more likely to be different from the result that would be obtained with a large ensemble. This is a general problem with statistical hypothesis testing, and it should be incorporated into the subjective interpretation of the tests. We propose some guidance based on the dependence of the variance of the control distribution with the size of the ensemble. As explained above, the control distribution is constructed from samples of UM¯(t;Nm)UO(t)EO(t). The observational error is typically small compared to the unforced variability, so we can approximate dependence of the variance as (1 + 1/Nm) × σ, where σ is the variance of UM(t) and Uo(t). As Nm becomes larger, the total variance decreases from 2 (in units of σ) to its asymptotic value of 1, with the rate of change being larger for small Nm. For instance, an ensemble of 10 members will reduce the variance to within 10% of its asymptotic value, which will significantly increase the robustness of the test.

We do not account for the uncertainty in radiative forcing, which could lead to overturning if the only objective is to match the warming over the historical period (e.g., Hourdin et al. 2017). However, we are not advocating making development choices only based on the approach presented here. A wide range of other metrics, including process-based metrics need to be considered. The use of a much wider basket of metrics should reduce the risk of overturning.

A final caveat is that the variance scaling cannot account for differences in models’ piControl variability on different time scales, so while the overall variability of two models can be scaled to be similar the interannual/multidecadal variability could be still very different. We have subjectively accounted for this in the discussion of the SST# results for GISS-E2-1-G, whose variability is dominated by large interannual variability, which can be confidently assessed with observations of the historical period. However, for variability at much longer time scales, the observational record provides very limited information. A possible approach to look at in the future is to account for this by applying different variance scaling factors for each value of the averaging window.

5. Conclusions

The historical record of surface temperature is an important metric that climate models should be able to reproduce. However, it is not consistently used by modeling centers during model development for two main and quite distinct reasons: first, coupled simulations are expensive to run, especially because the historical simulation must be preceded by a spinup simulation long enough to eliminate drift; second, the observed historical record of surface temperature is reserved as an out-of-sample validation. It is generally argued that the warming during the historical record and emergent properties like equilibrium climate sensitivity should be used as an a posteriori evaluation and not as a target for model development, although there is not complete consensus among the modeling community on this topic (Hourdin et al. 2017). Bock et al. (2020) highlight the risk of tuning models to reproduce a set of metrics while ignoring deficiencies elsewhere. However, this risk is not specific to metrics based on historical warming. Within the context of emergent constraints, Eyring et al. (2019) advocate the use of variability metrics or trends during model development.

We develop a statistical method to test whether simulations of large-scale surface temperature change are consistent with the observed warming of the historical period (1850–2014). The method uses information on a range of time scales. It incorporates information about unforced variability, and it is designed to test an entire ensemble of simulations of any size. The method is applied to annual-mean time series of three surface temperature indices: global mean, hemispheric gradient, and a recently developed index that captures the sea surface temperature (SST) pattern in the tropics (SST#; Fueglistaler and Silvers 2021). We test the historical simulations of the CMIP6 ensemble and post-CMIP6 versions of the HadGEM3 and UKESM models.

Around half the models fail the test for the global-mean time series, approximately a fifth of the models fail when the hemispheric temperature gradient is analyzed, and all models fail the SST# test. We note the importance of the characteristics of the models’ unforced variability (Parsons et al. 2020). Assessment of the quality of the historical simulations by visual comparison of the time series of a few ensemble members against the observations can be misleading, being reliable only for models with a large number of historical realizations. The method presented here complements other statistical approaches that have previously compared historical model simulations to observations (e.g., Sanderson et al. 2015; Brunner et al. 2020; Suarez-Gutierrez et al. 2021). Given that most modeling centers only run a small number of historical simulations, a method like the one presented here that accounts for the unforced variability is desirable, especially if the aim is to use it during the model development process, where large ensembles are not affordable.

We show that the method presented here can be used as a tool to assess historical simulations during the development process. The method is easy to apply and summarizes a large amount of information in two plots, with and without variance scaling. It accounts for the unforced variability of the model tested, and it can be applied to an ensemble of historical simulations of arbitrary size. We also plan to make this methodology available to the community by implementing it in the Earth System Model Evaluation Tool (ESMValTool; Eyring et al. 2020).

There are several avenues that could be explored to develop this method further. One potential improvement could be to incorporate information from proxy reconstructions to improve the estimate of the unforced variability, currently based on control model simulations. However, this may prove difficult given that many proxies do not resolve annual variability, and because of the nonstationarity of the magnitude of internal variability. Perhaps a better estimate of the unforced variability could be achieved by restricting the model set used to form the distributions of internal variability based on how models represent observational estimates of annual to decadal modes of variability (Fasullo et al. 2020).

A second area for further developments could be to apply a scaling factor, as it is done in optimal fingerprinting (e.g., Allen and Tett 1999). Some of the models that are rejected by our current methodology could pass the test if they are appropriately scaled. The interpretation of the test results with the scaled time series is not straight forward, but it may be useful to know that a model that is rejected could be made realistic by a scaling factor. Despite these caveats, the generality of the method presented here and the fact that it incorporates information about the unforced variability makes it a useful tool for the assessment of historical simulations.

Acknowledgments.

This work was supported by the Met Office Hadley Centre Climate Programme funded by BEIS. We thank an anonymous reviewer, J. T. Fasullo and R. C. J. Wills for their constructive comments that helped improve the original manuscript. We thank Gareth Jones for his contribution to the methodology and useful comments on an early draft version of the manuscript. We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6. We thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP6 and ESGF.

Data availability statement.

HadCRUT.5.0.1.0 data were obtained from http://www.metoffice.gov.uk/hadobs/hadcrut5 on 15 February 2021 and are British Crown Copyright, Met Office 2021, provided under an Open Government License, http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/. PCMDI AMIP SSTs were obtained from the ESGF archive, variable tosbcs from input4MIPs, version v20220201. NOAA_ERSST_V5 data provided by NOAA/OAR/ESRL PSL, Boulder, Colorado, were accessed at https://psl.noaa.gov/data/gridded/data.noaa.ersst.v5.html (Huang et al. 2017a,b).

REFERENCES

  • Allen, M. R., and S. F. B. Tett, 1999: Checking for model consistency in optimal fingerprinting. Climate Dyn., 15, 419434, https://doi.org/10.1007/s003820050291.

    • Search Google Scholar
    • Export Citation
  • Andrews, T., and M. J. Webb, 2018: The dependence of global cloud and lapse rate feedbacks on the spatial structure of tropical Pacific warming. J. Climate, 31, 641654, https://doi.org/10.1175/JCLI-D-17-0087.1.

    • Search Google Scholar
    • Export Citation
  • Annamalai, H., J. Hafner, K. P. Sooraj, and P. Pillai, 2013: Global warming shifts the monsoon circulation, drying South Asia. J. Climate, 26, 27012718, https://doi.org/10.1175/JCLI-D-12-00208.1.

    • Search Google Scholar
    • Export Citation
  • Bock, L., and Coauthors, 2020: Quantifying progress across different CMIP phases with the ESMValTool. J. Geophys. Res. Atmos., 125, e2019JD032321, https://doi.org/10.1029/2019JD032321.

    • Search Google Scholar
    • Export Citation
  • Boucher, O., and Coauthors, 2020: Presentation and evaluation of the IPSL-CM6A-LR climate model. J. Adv. Model. Earth Syst., 12, e2019MS002010, https://doi.org/10.1029/2019MS002010.

    • Search Google Scholar
    • Export Citation
  • Braganza, K., D. Karoly, A. Hirst, M. Mann, P. Stott, R. Stouffer, and S. Tett, 2003: Simple indices of global climate variability and change: Part I—Variability and correlation structure. Climate Dyn., 20, 491502, https://doi.org/10.1007/s00382-002-0286-0.

    • Search Google Scholar
    • Export Citation
  • Brunner, L., A. G. Pendergrass, F. Lehner, A. L. Merrifield, R. Lorenz, and R. Knutti, 2020: Reduced global warming from CMIP6 projections when weighting models by performance and independence. Earth Syst. Dyn., 11, 9951012, https://doi.org/10.5194/esd-11-995-2020.

    • Search Google Scholar
    • Export Citation
  • Ceppi, P., and J. M. Gregory, 2017: Relationship of tropospheric stability to climate sensitivity and Earth’s observed radiation budget. Proc. Natl. Acad. Sci. USA, 114, 13 12613 131, https://doi.org/10.1073/pnas.1714308114.

    • Search Google Scholar
    • Export Citation
  • Clement, A. C., R. Seager, M. A. Cane, and S. E. Zebiak, 1996: An ocean dynamical thermostat. J. Climate, 9, 21902196, https://doi.org/10.1175/1520-0442(1996)009<2190:AODT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Dittus, A. J., E. Hawkins, L. J. Wilcox, R. T. Sutton, C. J. Smith, M. B. Andrews, and P. M. Forster, 2020: Sensitivity of historical climate simulations to uncertain aerosol forcing. Geophys. Res. Lett., 47, e2019GL085806, https://doi.org/10.1029/2019GL085806.

    • Search Google Scholar
    • Export Citation
  • Dittus, A. J., E. Hawkins, J. I. Robson, D. M. Smith, and L. J. Wilcox, 2021: Drivers of recent North Pacific decadal variability: The role of aerosol forcing. Earth’s Future, 9, e2021EF002249, https://doi.org/10.1029/2021EF002249.

    • Search Google Scholar
    • Export Citation
  • Eyring, V., S. Bony, G. A. Meehl, C. A. Senior, B. Stevens, R. J. Stouffer, and K. E. Taylor, 2016: Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev., 9, 19371958, https://doi.org/10.5194/gmd-9-1937-2016.

    • Search Google Scholar
    • Export Citation
  • Eyring, V., and Coauthors, 2019: Taking climate model evaluation to the next level. Nat. Climate Change, 9, 102110, https://doi.org/10.1038/s41558-018-0355-y.

    • Search Google Scholar
    • Export Citation
  • Eyring, V., and Coauthors, 2020: Earth System Model Evaluation Tool (ESMValTool) v2.0—An extended set of large-scale diagnostics for quasi-operational and comprehensive evaluation of Earth system models in CMIP. Geosci. Model Dev., 13, 33833438, https://doi.org/10.5194/gmd-13-3383-2020.

    • Search Google Scholar
    • Export Citation
  • Fasullo, J. T., A. S. Phillips, and C. Deser, 2020: Evaluation of leading modes of climate variability in the CMIP archives. J. Climate, 33, 55275545, https://doi.org/10.1175/JCLI-D-19-1024.1.

    • Search Google Scholar
    • Export Citation
  • Fasullo, J. T., Lamarque, J.-F., Hannay, C., Rosenbloom, N., Tilmes, S., DeRepentigny, P. A. Jahn, and C. Deser, 2022: Spurious late historical-era warming in CESM2 driven by prescribed biomass burning emissions. Geophys. Res. Lett., 49, e2021GL097420, https://doi.org/10.1029/2021GL097420.

    • Search Google Scholar
    • Export Citation
  • Flynn, C. M., and T. Mauritsen, 2020: On the climate sensitivity and historical warming evolution in recent coupled model ensembles. Atmos. Chem. Phys., 20, 78297842, https://doi.org/10.5194/acp-20-7829-2020.

    • Search Google Scholar
    • Export Citation
  • Fueglistaler, S., and L. G. Silvers, 2021: The peculiar trajectory of global warming. J. Geophy. Res. Atmos., 126, e2020JD033629, https://doi.org/10.1029/2020JD033629.

    • Search Google Scholar
    • Export Citation
  • Gillett, N. P., F. W. Zwiers, A. J. Weaver, G. C. Hegerl, M. R. Allen, and P. A. Stott, 2002: Detecting anthropogenic influence with a multi-model ensemble. Geophys. Res. Lett., 29, 1970, https://doi.org/10.1029/2002GL015836.

    • Search Google Scholar
    • Export Citation
  • Golaz, J.-C., and Coauthors, 2019: The DOE E3SM coupled model version 1: Overview and evaluation at standard resolution. J. Adv. Model. Earth Syst., 11, 20892129, https://doi.org/10.1029/2018MS001603.

    • Search Google Scholar
    • Export Citation
  • Gregory, J. M., and T. Andrews, 2016: Variation in climate sensitivity and feedback parameters during the historical period. Geophys. Res. Lett., 43, 39113920, https://doi.org/10.1002/2016GL068406.

    • Search Google Scholar
    • Export Citation
  • Gregory, J. M., T. Andrews, P. Ceppi, T. Mauritsen, and M. J. Webb, 2020: How accurately can the climate sensitivity to CO2 be estimated from historical climate change? Climate Dyn., 54, 129157, https://doi.org/10.1007/s00382-019-04991-y.

    • Search Google Scholar
    • Export Citation
  • Gulev, S. K., and Coauthors, 2021: Changing state of the climate system. Climate Change 2021: The Physical Science Basis, V. Masson-Delmotte et al., Eds., Cambridge University Press, 287–422, https://www.ipcc.ch/report/ar6/wg1/downloads/report/IPCC_AR6_WGI_Chapter02.pdf.

  • Heede, U. K., and A. V. Fedorov, 2021: Eastern equatorial Pacific warming delayed by aerosols and thermostat response to CO2 increase. Nat. Climate Change, 11, 696703, https://doi.org/10.1038/s41558-021-01101-x.

    • Search Google Scholar
    • Export Citation
  • Hourdin, F., and Coauthors, 2017: The art and science of climate model tuning. Bull. Amer. Meteor. Soc., 98, 589602, https://doi.org/10.1175/BAMS-D-15-00135.1.

    • Search Google Scholar
    • Export Citation
  • Huang, B., and Coauthors, 2017a: Extended Reconstructed Sea Surface Temperature, version 5 (ERSSTv5): Upgrades, validations, and intercomparisons. J. Climate, 30, 81798205, https://doi.org/10.1175/JCLI-D-16-0836.1.

    • Search Google Scholar
    • Export Citation
  • Huang, B., and Coauthors, 2017b: NOAA Extended Reconstructed Sea Surface Temperature (ERSST), version 5. NOAA National Centers for Environmental Information, accessed 1 September 2021, https://doi.org/10.7289/V5T72FNM.

  • Hurrell, J. W., J. J. Hack, D. Shea, J. M. Caron, and J. Rosinski, 2008: A new sea surface temperature and sea ice boundary dataset for the Community Atmosphere Model. J. Climate, 21, 51455153, https://doi.org/10.1175/2008JCLI2292.1.

    • Search Google Scholar
    • Export Citation
  • Jones, G. S., 2020: “Apples and oranges”: On comparing simulated historic near-surface temperature changes with observations. Quart. J. Roy. Meteor. Soc., 146, 37473771, https://doi.org/10.1002/qj.3871.

    • Search Google Scholar
    • Export Citation
  • Jones, G. S., P. A. Stott, and N. Christidis, 2013: Attribution of observed historical near–surface temperature variations to anthropogenic and natural causes using CMIP5 simulations. J. Geophys. Res. Atmos., 118, 40014024, https://doi.org/10.1002/jgrd.50239.

    • Search Google Scholar
    • Export Citation
  • Jones, P., 2016: The reliability of global and hemispheric surface temperature records. Adv. Atmos. Sci., 33, 269282, https://doi.org/10.1007/s00376-015-5194-4.

    • Search Google Scholar
    • Export Citation
  • Mauritsen, T., and Coauthors, 2019: Developments in the MPI‐M Earth system model version 1.2 (MPI‐ESM1.2) and its response to increasing CO2. J. Adv. Model. Earth Syst., 11, 9981038, https://doi.org/10.1029/2018MS001400.

    • Search Google Scholar
    • Export Citation
  • McKinnon, K. A., and C. Deser, 2018: Internal variability and regional climate trends in an observational large ensemble. J. Climate, 31, 67836802, https://doi.org/10.1175/JCLI-D-17-0901.1.

    • Search Google Scholar
    • Export Citation
  • Miller, R. L., 1997: Tropical thermostats and low cloud cover. J. Climate, 10, 409440, https://doi.org/10.1175/1520-0442(1997)010<0409:TTALCC>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Morice, C. P., and Coauthors, 2021: An updated assessment of near-surface temperature change from 1850: the HadCRUT5 data set. J. Geophys. Res. Atmos., 126, e2019JD032361, https://doi.org/10.1029/2019JD032361.

    • Search Google Scholar
    • Export Citation
  • Mulcahy, J. P., and Coauthors, 2018: Improved aerosol processes and effective radiative forcing in HadGEM3 and UKESM1. J. Adv. Model. Earth Syst., 10, 27862805, https://doi.org/10.1029/2018MS001464.

    • Search Google Scholar
    • Export Citation
  • Mulcahy, J. P., and Coauthors, 2023: UKESM1.1: Development and evaluation of an updated configuration of the UK Earth System Model. Geosci. Model Dev., https://doi.org/10.5194/gmd-2022-113, in press.

    • Search Google Scholar
    • Export Citation
  • Olonscheck, D., and D. Notz, 2017: Consistently estimating internal climate variability from climate model simulations. J. Climate, 30, 95559573, https://doi.org/10.1175/JCLI-D-16-0428.1.

    • Search Google Scholar
    • Export Citation
  • Olonscheck, D., M. Rugenstein, and J. Marotzke, 2020: Broad consistency between observed and simulated trends in sea surface temperature patterns. Geophys. Res. Lett., 47, e2019GL086773, https://doi.org/10.1029/2019GL086773.

    • Search Google Scholar
    • Export Citation
  • Orbe, C., and Coauthors, 2020: Representation of modes of variability in six U.S. climate models. J. Climate, 33, 75917617, https://doi.org/10.1175/JCLI-D-19-0956.1.

    • Search Google Scholar
    • Export Citation
  • PAGES 2k Consortium, 2013: Continental-scale temperature variability during the past two millennia. Nat. Geosci., 6, 339346, https://doi.org/10.1038/ngeo1797.

    • Search Google Scholar
    • Export Citation
  • PAGES 2k Consortium, 2019: Consistent multidecadal variability in global temperature reconstructions and simulations over the Common Era. Nat. Geosci., 12, 643649, https://doi.org/10.1038/s41561-019-0400-0.

    • Search Google Scholar
    • Export Citation
  • Parsons, L. A., M. K. Brennan, R. C. J. Wills, and C. Proistosescu, 2020: Magnitudes and spatial patterns of interdecadal temperature variability in CMIP6. Geophys. Res. Lett., 47, e2019GL086588, https://doi.org/10.1029/2019GL086588.

    • Search Google Scholar
    • Export Citation
  • Paulot, F., D. Paynter, P. Ginoux, V. Naik, and L. W. Horowitz, 2018: Changes in the aerosol direct radiative forcing from 2001 to 2015: Observational constraints and regional mechanisms. Atmos. Chem. Phys., 18, 13 26513 281, https://doi.org/10.5194/acp-18-13265-2018.

    • Search Google Scholar
    • Export Citation
  • Planton, Y. Y., and Coauthors, 2021: Evaluating climate models with the CLIVAR 2020 ENSO metrics package. Bull. Amer. Meteor. Soc., 102, E193E217, https://doi.org/10.1175/BAMS-D-19-0337.1.

    • Search Google Scholar
    • Export Citation
  • Reichler, T., and J. Kim, 2008: How well do coupled models simulate today’s climate? Bull. Amer. Meteor. Soc., 89, 303312, https://doi.org/10.1175/BAMS-89-3-303.

    • Search Google Scholar
    • Export Citation
  • Richardson, M., K. Cowtan, E. Hawkins and M. B. Stolpe, 2016: Reconciled climate response estimates from climate models and the energy budget of Earth. Nat. Climate Change, 6, 931935, https://doi.org/10.1038/nclimate3066.

    • Search Google Scholar
    • Export Citation
  • Rieger, L. A., J. N. S. Cole, J. C. Fyfe, S. Po-Chedley, P. J. Cameron-Smith, P. J. Durack, N. P. Gillett, and Q. Tang, 2020: Quantifying CanESM5 and EAMv1 sensitivities to Mt. Pinatubo volcanic forcing for the CMIP6 historical experiment. Geosci. Model Dev., 13, 48314843, https://doi.org/10.5194/gmd-13-4831-2020.

    • Search Google Scholar
    • Export Citation
  • Sanderson, B. M., R. Knutti, and P. Caldwell, 2015: A representative democracy to reduce interdependency in a multimodel ensemble. J. Climate, 28, 51715194, https://doi.org/10.1175/JCLI-D-14-00362.1.

    • Search Google Scholar
    • Export Citation
  • Seager, R., M. Cane, N. Henderson, D.-E. Lee, R. Abernathey, and H. Zhang, 2019: Strengthening tropical Pacific zonal sea surface temperature gradient consistent with rising greenhouse gases. Nat. Climate Change, 9, 517522, https://doi.org/10.1038/s41558-019-0505-x.

    • Search Google Scholar
    • Export Citation
  • Sen Gupta, A., N. C. Jourdain, J. N. Brown, and D. Monselesan, 2013: Climate drift in the CMIP5 models. J. Climate, 26, 85978615, https://doi.org/10.1175/JCLI-D-12-00521.1.

    • Search Google Scholar
    • Export Citation
  • Smith, C. J., and Coauthors, 2021: Energy budget constraints on the time history of aerosol forcing and climate sensitivity. J. Geophys. Res. Atmos., 126, e2020JD033622, https://doi.org/10.1029/2020JD033622.

    • Search Google Scholar
    • Export Citation
  • Suarez-Gutierrez, L., S. Milinski, and N. Maher, 2021: Exploiting large ensembles for a better yet simpler climate model evaluation. Climate Dyn., 57, 25572580, https://doi.org/10.1007/s00382-021-05821-w.

    • Search Google Scholar
    • Export Citation
  • Taylor, K. E., D. Williamson, and F. Zwiers, 2000: The sea surface temperature and sea ice concentration boundary conditions for AMIP II simulations. PCMDI Rep. 60, 28 pp., https://pcmdi.llnl.gov/report/pdf/60.pdf.

  • Wang, C., B. J. Soden, W. Yang, and G. A. Vecchi, 2021: Compensation between cloud feedback and aerosol-cloud interaction in CMIP6 models. Geophys. Res. Lett., 48, e2020GL091024, https://doi.org/10.1029/2020GL091024.

    • Search Google Scholar
    • Export Citation
  • Watanabe, M., J.-L. Dufresne, Y. Kosaka, T. Mauritsen, and H. Tatebe, 2021: Enhanced warming constrained by past trends in equatorial Pacific sea surface temperature gradient. Nat. Climate Change, 11, 3337, https://doi.org/10.1038/s41558-020-00933-3.

    • Search Google Scholar
    • Export Citation
  • Wills, R. C. J., Y. Dong, C. Proistosecu, K. C. Armour, and D. S. Battisti, 2022: Systematic climate model biases in the large-scale patterns of recent sea-surface temperature and sea-level pressure change. Geophys. Res. Lett., 49, e2022GL100011, https://doi.org/10.1029/2022GL100011.

    • Search Google Scholar
    • Export Citation
  • Zhang, J., and Coauthors, 2021: The role of anthropogenic aerosols in the anomalous cooling from 1960 to 1990 in the CMIP6 Earth system models. Atmos. Chem. Phys., 21, 18 60918 627, https://doi.org/10.5194/acp-21-18609-2021.

    • Search Google Scholar
    • Export Citation
  • Zinke, J., S. A. Browning, A. Hoell, and I. D. Goodwin, 2021: The west Pacific gradient tracks ENSO and zonal Pacific sea surface temperature gradient during the last millennium. Sci. Rep., 11, 20395, https://doi.org/10.1038/s41598-021-99738-3.

    • Search Google Scholar
    • Export Citation
Save
  • Allen, M. R., and S. F. B. Tett, 1999: Checking for model consistency in optimal fingerprinting. Climate Dyn., 15, 419434, https://doi.org/10.1007/s003820050291.

    • Search Google Scholar
    • Export Citation
  • Andrews, T., and M. J. Webb, 2018: The dependence of global cloud and lapse rate feedbacks on the spatial structure of tropical Pacific warming. J. Climate, 31, 641654, https://doi.org/10.1175/JCLI-D-17-0087.1.

    • Search Google Scholar
    • Export Citation
  • Annamalai, H., J. Hafner, K. P. Sooraj, and P. Pillai, 2013: Global warming shifts the monsoon circulation, drying South Asia. J. Climate, 26, 27012718, https://doi.org/10.1175/JCLI-D-12-00208.1.

    • Search Google Scholar
    • Export Citation
  • Bock, L., and Coauthors, 2020: Quantifying progress across different CMIP phases with the ESMValTool. J. Geophys. Res. Atmos., 125, e2019JD032321, https://doi.org/10.1029/2019JD032321.

    • Search Google Scholar
    • Export Citation
  • Boucher, O., and Coauthors, 2020: Presentation and evaluation of the IPSL-CM6A-LR climate model. J. Adv. Model. Earth Syst., 12, e2019MS002010, https://doi.org/10.1029/2019MS002010.

    • Search Google Scholar
    • Export Citation
  • Braganza, K., D. Karoly, A. Hirst, M. Mann, P. Stott, R. Stouffer, and S. Tett, 2003: Simple indices of global climate variability and change: Part I—Variability and correlation structure. Climate Dyn., 20, 491502, https://doi.org/10.1007/s00382-002-0286-0.

    • Search Google Scholar
    • Export Citation
  • Brunner, L., A. G. Pendergrass, F. Lehner, A. L. Merrifield, R. Lorenz, and R. Knutti, 2020: Reduced global warming from CMIP6 projections when weighting models by performance and independence. Earth Syst. Dyn., 11, 9951012, https://doi.org/10.5194/esd-11-995-2020.

    • Search Google Scholar
    • Export Citation
  • Ceppi, P., and J. M. Gregory, 2017: Relationship of tropospheric stability to climate sensitivity and Earth’s observed radiation budget. Proc. Natl. Acad. Sci. USA, 114, 13 12613 131, https://doi.org/10.1073/pnas.1714308114.

    • Search Google Scholar
    • Export Citation
  • Clement, A. C., R. Seager, M. A. Cane, and S. E. Zebiak, 1996: An ocean dynamical thermostat. J. Climate, 9, 21902196, https://doi.org/10.1175/1520-0442(1996)009<2190:AODT>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Dittus, A. J., E. Hawkins, L. J. Wilcox, R. T. Sutton, C. J. Smith, M. B. Andrews, and P. M. Forster, 2020: Sensitivity of historical climate simulations to uncertain aerosol forcing. Geophys. Res. Lett., 47, e2019GL085806, https://doi.org/10.1029/2019GL085806.

    • Search Google Scholar
    • Export Citation
  • Dittus, A. J., E. Hawkins, J. I. Robson, D. M. Smith, and L. J. Wilcox, 2021: Drivers of recent North Pacific decadal variability: The role of aerosol forcing. Earth’s Future, 9, e2021EF002249, https://doi.org/10.1029/2021EF002249.

    • Search Google Scholar
    • Export Citation
  • Eyring, V., S. Bony, G. A. Meehl, C. A. Senior, B. Stevens, R. J. Stouffer, and K. E. Taylor, 2016: Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev., 9, 19371958, https://doi.org/10.5194/gmd-9-1937-2016.

    • Search Google Scholar
    • Export Citation
  • Eyring, V., and Coauthors, 2019: Taking climate model evaluation to the next level. Nat. Climate Change, 9, 102110, https://doi.org/10.1038/s41558-018-0355-y.

    • Search Google Scholar
    • Export Citation
  • Eyring, V., and Coauthors, 2020: Earth System Model Evaluation Tool (ESMValTool) v2.0—An extended set of large-scale diagnostics for quasi-operational and comprehensive evaluation of Earth system models in CMIP. Geosci. Model Dev., 13, 33833438, https://doi.org/10.5194/gmd-13-3383-2020.

    • Search Google Scholar
    • Export Citation
  • Fasullo, J. T., A. S. Phillips, and C. Deser, 2020: Evaluation of leading modes of climate variability in the CMIP archives. J. Climate, 33, 55275545, https://doi.org/10.1175/JCLI-D-19-1024.1.

    • Search Google Scholar
    • Export Citation
  • Fasullo, J. T., Lamarque, J.-F., Hannay, C., Rosenbloom, N., Tilmes, S., DeRepentigny, P. A. Jahn, and C. Deser, 2022: Spurious late historical-era warming in CESM2 driven by prescribed biomass burning emissions. Geophys. Res. Lett., 49, e2021GL097420, https://doi.org/10.1029/2021GL097420.

    • Search Google Scholar
    • Export Citation
  • Flynn, C. M., and T. Mauritsen, 2020: On the climate sensitivity and historical warming evolution in recent coupled model ensembles. Atmos. Chem. Phys., 20, 78297842, https://doi.org/10.5194/acp-20-7829-2020.

    • Search Google Scholar
    • Export Citation
  • Fueglistaler, S., and L. G. Silvers, 2021: The peculiar trajectory of global warming. J. Geophy. Res. Atmos., 126, e2020JD033629, https://doi.org/10.1029/2020JD033629.

    • Search Google Scholar
    • Export Citation
  • Gillett, N. P., F. W. Zwiers, A. J. Weaver, G. C. Hegerl, M. R. Allen, and P. A. Stott, 2002: Detecting anthropogenic influence with a multi-model ensemble. Geophys. Res. Lett., 29, 1970, https://doi.org/10.1029/2002GL015836.

    • Search Google Scholar
    • Export Citation
  • Golaz, J.-C., and Coauthors, 2019: The DOE E3SM coupled model version 1: Overview and evaluation at standard resolution. J. Adv. Model. Earth Syst., 11, 20892129, https://doi.org/10.1029/2018MS001603.

    • Search Google Scholar
    • Export Citation
  • Gregory, J. M., and T. Andrews, 2016: Variation in climate sensitivity and feedback parameters during the historical period. Geophys. Res. Lett., 43, 39113920, https://doi.org/10.1002/2016GL068406.

    • Search Google Scholar
    • Export Citation
  • Gregory, J. M., T. Andrews, P. Ceppi, T. Mauritsen, and M. J. Webb, 2020: How accurately can the climate sensitivity to CO2 be estimated from historical climate change? Climate Dyn., 54, 129157, https://doi.org/10.1007/s00382-019-04991-y.

    • Search Google Scholar
    • Export Citation
  • Gulev, S. K., and Coauthors, 2021: Changing state of the climate system. Climate Change 2021: The Physical Science Basis, V. Masson-Delmotte et al., Eds., Cambridge University Press, 287–422, https://www.ipcc.ch/report/ar6/wg1/downloads/report/IPCC_AR6_WGI_Chapter02.pdf.

  • Heede, U. K., and A. V. Fedorov, 2021: Eastern equatorial Pacific warming delayed by aerosols and thermostat response to CO2 increase. Nat. Climate Change, 11, 696703, https://doi.org/10.1038/s41558-021-01101-x.

    • Search Google Scholar
    • Export Citation
  • Hourdin, F., and Coauthors, 2017: The art and science of climate model tuning. Bull. Amer. Meteor. Soc., 98, 589602, https://doi.org/10.1175/BAMS-D-15-00135.1.

    • Search Google Scholar
    • Export Citation
  • Huang, B., and Coauthors, 2017a: Extended Reconstructed Sea Surface Temperature, version 5 (ERSSTv5): Upgrades, validations, and intercomparisons. J. Climate, 30, 81798205, https://doi.org/10.1175/JCLI-D-16-0836.1.

    • Search Google Scholar
    • Export Citation
  • Huang, B., and Coauthors, 2017b: NOAA Extended Reconstructed Sea Surface Temperature (ERSST), version 5. NOAA National Centers for Environmental Information, accessed 1 September 2021, https://doi.org/10.7289/V5T72FNM.

  • Hurrell, J. W., J. J. Hack, D. Shea, J. M. Caron, and J. Rosinski, 2008: A new sea surface temperature and sea ice boundary dataset for the Community Atmosphere Model. J. Climate, 21, 51455153, https://doi.org/10.1175/2008JCLI2292.1.

    • Search Google Scholar
    • Export Citation
  • Jones, G. S., 2020: “Apples and oranges”: On comparing simulated historic near-surface temperature changes with observations. Quart. J. Roy. Meteor. Soc., 146, 37473771, https://doi.org/10.1002/qj.3871.

    • Search Google Scholar
    • Export Citation
  • Jones, G. S., P. A. Stott, and N. Christidis, 2013: Attribution of observed historical near–surface temperature variations to anthropogenic and natural causes using CMIP5 simulations. J. Geophys. Res. Atmos., 118, 40014024, https://doi.org/10.1002/jgrd.50239.

    • Search Google Scholar
    • Export Citation
  • Jones, P., 2016: The reliability of global and hemispheric surface temperature records. Adv. Atmos. Sci., 33, 269282, https://doi.org/10.1007/s00376-015-5194-4.

    • Search Google Scholar
    • Export Citation
  • Mauritsen, T., and Coauthors, 2019: Developments in the MPI‐M Earth system model version 1.2 (MPI‐ESM1.2) and its response to increasing CO2. J. Adv. Model. Earth Syst., 11, 9981038, https://doi.org/10.1029/2018MS001400.

    • Search Google Scholar
    • Export Citation
  • McKinnon, K. A., and C. Deser, 2018: Internal variability and regional climate trends in an observational large ensemble. J. Climate, 31, 67836802, https://doi.org/10.1175/JCLI-D-17-0901.1.

    • Search Google Scholar
    • Export Citation
  • Miller, R. L., 1997: Tropical thermostats and low cloud cover. J. Climate, 10, 409440, https://doi.org/10.1175/1520-0442(1997)010<0409:TTALCC>2.0.CO;2.

    • Search Google Scholar
    • Export Citation
  • Morice, C. P., and Coauthors, 2021: An updated assessment of near-surface temperature change from 1850: the HadCRUT5 data set. J. Geophys. Res. Atmos., 126, e2019JD032361, https://doi.org/10.1029/2019JD032361.

    • Search Google Scholar
    • Export Citation
  • Mulcahy, J. P., and Coauthors, 2018: Improved aerosol processes and effective radiative forcing in HadGEM3 and UKESM1. J. Adv. Model. Earth Syst., 10, 27862805, https://doi.org/10.1029/2018MS001464.

    • Search Google Scholar
    • Export Citation
  • Mulcahy, J. P., and Coauthors, 2023: UKESM1.1: Development and evaluation of an updated configuration of the UK Earth System Model. Geosci. Model Dev., https://doi.org/10.5194/gmd-2022-113, in press.

    • Search Google Scholar
    • Export Citation
  • Olonscheck, D., and D. Notz, 2017: Consistently estimating internal climate variability from climate model simulations. J. Climate, 30, 95559573, https://doi.org/10.1175/JCLI-D-16-0428.1.

    • Search Google Scholar
    • Export Citation
  • Olonscheck, D., M. Rugenstein, and J. Marotzke, 2020: Broad consistency between observed and simulated trends in sea surface temperature patterns. Geophys. Res. Lett., 47, e2019GL086773, https://doi.org/10.1029/2019GL086773.

    • Search Google Scholar
    • Export Citation
  • Orbe, C., and Coauthors, 2020: Representation of modes of variability in six U.S. climate models. J. Climate, 33, 75917617, https://doi.org/10.1175/JCLI-D-19-0956.1.

    • Search Google Scholar
    • Export Citation
  • PAGES 2k Consortium, 2013: Continental-scale temperature variability during the past two millennia. Nat. Geosci., 6, 339346, https://doi.org/10.1038/ngeo1797.

    • Search Google Scholar
    • Export Citation
  • PAGES 2k Consortium, 2019: Consistent multidecadal variability in global temperature reconstructions and simulations over the Common Era. Nat. Geosci., 12, 643649, https://doi.org/10.1038/s41561-019-0400-0.

    • Search Google Scholar
    • Export Citation
  • Parsons, L. A., M. K. Brennan, R. C. J. Wills, and C. Proistosescu, 2020: Magnitudes and spatial patterns of interdecadal temperature variability in CMIP6. Geophys. Res. Lett., 47, e2019GL086588, https://doi.org/10.1029/2019GL086588.

    • Search Google Scholar
    • Export Citation
  • Paulot, F., D. Paynter, P. Ginoux, V. Naik, and L. W. Horowitz, 2018: Changes in the aerosol direct radiative forcing from 2001 to 2015: Observational constraints and regional mechanisms. Atmos. Chem. Phys., 18, 13 26513 281, https://doi.org/10.5194/acp-18-13265-2018.

    • Search Google Scholar
    • Export Citation
  • Planton, Y. Y., and Coauthors, 2021: Evaluating climate models with the CLIVAR 2020 ENSO metrics package. Bull. Amer. Meteor. Soc., 102, E193E217, https://doi.org/10.1175/BAMS-D-19-0337.1.

    • Search Google Scholar
    • Export Citation
  • Reichler, T., and J. Kim, 2008: How well do coupled models simulate today’s climate? Bull. Amer. Meteor. Soc., 89, 303312, https://doi.org/10.1175/BAMS-89-3-303.

    • Search Google Scholar
    • Export Citation
  • Richardson, M., K. Cowtan, E. Hawkins and M. B. Stolpe, 2016: Reconciled climate response estimates from climate models and the energy budget of Earth. Nat. Climate Change, 6, 931935, https://doi.org/10.1038/nclimate3066.

    • Search Google Scholar
    • Export Citation
  • Rieger, L. A., J. N. S. Cole, J. C. Fyfe, S. Po-Chedley, P. J. Cameron-Smith, P. J. Durack, N. P. Gillett, and Q. Tang, 2020: Quantifying CanESM5 and EAMv1 sensitivities to Mt. Pinatubo volcanic forcing for the CMIP6 historical experiment. Geosci. Model Dev., 13, 48314843, https://doi.org/10.5194/gmd-13-4831-2020.

    • Search Google Scholar
    • Export Citation
  • Sanderson, B. M., R. Knutti, and P. Caldwell, 2015: A representative democracy to reduce interdependency in a multimodel ensemble. J. Climate, 28, 51715194, https://doi.org/10.1175/JCLI-D-14-00362.1.

    • Search Google Scholar
    • Export Citation
  • Seager, R., M. Cane, N. Henderson, D.-E. Lee, R. Abernathey, and H. Zhang, 2019: Strengthening tropical Pacific zonal sea surface temperature gradient consistent with rising greenhouse gases. Nat. Climate Change, 9, 517522, https://doi.org/10.1038/s41558-019-0505-x.

    • Search Google Scholar
    • Export Citation
  • Sen Gupta, A., N. C. Jourdain, J. N. Brown, and D. Monselesan, 2013: Climate drift in the CMIP5 models. J. Climate, 26, 85978615, https://doi.org/10.1175/JCLI-D-12-00521.1.

    • Search Google Scholar
    • Export Citation
  • Smith, C. J., and Coauthors, 2021: Energy budget constraints on the time history of aerosol forcing and climate sensitivity. J. Geophys. Res. Atmos., 126, e2020JD033622, https://doi.org/10.1029/2020JD033622.

    • Search Google Scholar
    • Export Citation
  • Suarez-Gutierrez, L., S. Milinski, and N. Maher, 2021: Exploiting large ensembles for a better yet simpler climate model evaluation. Climate Dyn., 57, 25572580, https://doi.org/10.1007/s00382-021-05821-w.

    • Search Google Scholar
    • Export Citation
  • Taylor, K. E., D. Williamson, and F. Zwiers, 2000: The sea surface temperature and sea ice concentration boundary conditions for AMIP II simulations. PCMDI Rep. 60, 28 pp., https://pcmdi.llnl.gov/report/pdf/60.pdf.

  • Wang, C., B. J. Soden, W. Yang, and G. A. Vecchi, 2021: Compensation between cloud feedback and aerosol-cloud interaction in CMIP6 models. Geophys. Res. Lett., 48, e2020GL091024, https://doi.org/10.1029/2020GL091024.

    • Search Google Scholar
    • Export Citation
  • Watanabe, M., J.-L. Dufresne, Y. Kosaka, T. Mauritsen, and H. Tatebe, 2021: Enhanced warming constrained by past trends in equatorial Pacific sea surface temperature gradient. Nat. Climate Change, 11, 3337, https://doi.org/10.1038/s41558-020-00933-3.

    • Search Google Scholar
    • Export Citation
  • Wills, R. C. J., Y. Dong, C. Proistosecu, K. C. Armour, and D. S. Battisti, 2022: Systematic climate model biases in the large-scale patterns of recent sea-surface temperature and sea-level pressure change. Geophys. Res. Lett., 49, e2022GL100011, https://doi.org/10.1029/2022GL100011.

    • Search Google Scholar
    • Export Citation
  • Zhang, J., and Coauthors, 2021: The role of anthropogenic aerosols in the anomalous cooling from 1960 to 1990 in the CMIP6 Earth system models. Atmos. Chem. Phys., 21, 18 60918 627, https://doi.org/10.5194/acp-21-18609-2021.

    • Search Google Scholar
    • Export Citation
  • Zinke, J., S. A. Browning, A. Hoell, and I. D. Goodwin, 2021: The west Pacific gradient tracks ENSO and zonal Pacific sea surface temperature gradient during the last millennium. Sci. Rep., 11, 20395, https://doi.org/10.1038/s41598-021-99738-3.

    • Search Google Scholar
    • Export Citation
  • Fig. 1.

    Total observational error (Eo) of the global-mean metric. The gray lines show the residuals of individual realizations of the HadCRUT5 global-mean analysis, including a randomly generated contribution that accounts for the coverage error. The black lines are the bounds of the 95% confidence interval.

  • Fig. 2.

    Graphical example of the calculation of the number of exceedances for a given pair of segments of the piControl simulations. This example is for GMSAT, but the method is the same for all indices. The blue line shows the difference between the two piControl segments that provides a sample of UMUO. The red line is the absolute value of the 10-yr running mean of the blue line. The green line represents the exceedance threshold, 0.1 K in this example. The number of exceedances is the number of red points above the green line.

  • Fig. 3.

    Empirical quantile distribution functions QZ(p; T, y, Nm