1. Introduction
The historical record of near-surface air temperature (SAT) is widely used as a performance metric for climate models (e.g., Braganza et al. 2003; Reichler and Kim 2008). The time series of annual-mean anomalies is a benchmark against which models are tested, and it has been used to assess the credibility of a model’s ability to provide information on future changes (e.g., Brunner et al. 2020). Recent research suggests that the later part of the historical period (1980 onward) contains information about the sensitivity of Earth’s climate to external forcing (Flynn and Mauritsen 2020; Dittus et al. 2020), although this relationship may not be as strong as suggested due to common model biases in the simulation of historical sea surface temperature (SST) patterns (Andrews and Webb 2018; Ceppi and Gregory 2017), the sensitivity to biomass aerosols (Fasullo et al. 2022), or a non-negligible contribution of internal variability on multidecadal trends (McKinnon and Deser 2018). The tropical SST patterns are strongly connected to regional precipitation anomalies, of relevance for the accurate drought-inducing teleconnections (e.g., Annamalai et al. 2013; Zinke et al. 2021). Also, the radiative forcing over the historical record is uncertain, mainly due to the role of aerosols (e.g., Smith et al. 2021), with important implications for the historical warming shown by models (e.g., Wang et al. 2021; Zhang et al. 2021). Potentially, all this information can be used to improve the model’s response to external forcing subject to the constraints of process observations. However, there is no common approach on how to incorporate the historical record into model development.
Several modeling centers have directly “calibrated” or “tuned” historical simulations (i.e., adjusted them to improve realism of climate change simulation) during the development of the models used for phase 6 of the Climate Model Intercomparison Project (CMIP6; Eyring et al. 2016), while others did not use historical simulations during development. For example, the historical warming of the MPI-ESM1.2-LR model was tuned by reducing its climate sensitivity during its development (Mauritsen et al. 2019). Boucher et al. (2020) describe the developments and performance of the IPSL‐CM6A‐LR model. Although historical simulations were not used as part of the development, the r1i1p1f1 simulation was selected qualitatively among the first ∼12 available historical members, based on a few key observables of the historical period. During the development of the Energy Exascale Earth System Model, version 1 (E3SMv1), a historical simulation was performed with a near-final version of the model, but no action was taken to change the historical performance in the final version (Golaz et al. 2019).
The use of historical runs (or any coupled run with transient forcing) for tuning is not part of the Met Office Unified Model (UM) development protocol. The Hadley Centre models submitted to CMIP6 were not tuned to the historical record, although several model improvements were added to ensure that the total present-day radiative forcing was positive (Mulcahy et al. 2018). This approach was revised in the 2020 UM Users Workshop, where it was agreed that one of the key model errors was the simulation of the historical record. As a result, a Prioritised Evaluation Group (PEG) was created with the objective of improving the simulation of the historical global-mean surface temperature record. Also, in a recent review of the UM’s Global Configuration (GC) development protocol, it was agreed that a small ensemble of historical simulations will be run during the final stage of the development cycle, opening the option to implement model changes that target the performance of the simulation of the historical record before the final configuration is delivered to the users. In this paper we present the first step toward incorporating historical information into the UM’s development process. We develop a statistical method to test whether simulations of large-scale surface temperature change are realistic during the historical period (1850–2014). The method is applied to annual-mean time series of three surface temperature indices: global mean, hemispheric gradient, and a recently developed index that captures the SST pattern in the tropics (SST#; Fueglistaler and Silvers 2021). We test the historical simulations of the CMIP6 ensemble and post-CMIP6 versions of the HadGEM3 and UKESM models. We use the term “realistic” in a relative manner: a model that performs well against the tests described here can do so due to compensating errors (e.g., between forcings and feedbacks). Consequently, those models that we label as realistic in the present study could nonetheless be rejected once other metrics with additional observational evidence or process understanding are considered. This shortcoming is not specific to this methodology, and the method we propose here should be used along a wide range of diagnostics to provide a detailed assessment. The structure of the paper is as follows. Section 2 describes the observational and model data. The statistical methodology is detailed in section 3, and section 4 presents the results of the method applied to the CMIP6 historical ensemble. Finally, section 5 discusses the results and conclusions.
2. Model data and observations
We use near-surface air temperature (CMIP variable “tas”) data from the piControl and historical experiments of the CMIP6 archive, which are atmosphere–ocean coupled simulations. The piControl are unforced simulations with forcing agents set at preindustrial levels (year 1850). After a spinup period, the CMIP6 protocol requests a minimum of 500 simulation years, but not all models fulfil this criterion. We explain how we deal with different lengths of the piControl time series in the next section.
The CMIP6 protocol (Eyring et al. 2016) recommended that the historical experiments are run with the current best estimates of the time-evolving datasets of forcing agents: atmospheric composition, solar irradiance, natural and anthropogenic aerosols, and land-use change, but not all institutions followed the protocol. They branch from the piControl simulation, running from 1850 to 2014 (165 years). The CMIP6 protocol recommends running at least three historical simulations, branching from different points in the piControl simulations. We use 40 piControl simulations from the CMIP6 ensemble, plus simulations from GC4.0-LL and UKESM1.1-LL (Mulcahy et al. 2023), models developed after CMIP6.
We use three different observational datasets of surface temperature: the Met Office Hadley Centre/Climatic Research Unit global surface temperature dataset, version 5 (HadCRUT5.0.1.0; Morice et al. 2021), the Program for Climate Model Diagnosis and Intercomparison (PCMDI) SST reconstruction (Hurrell et al. 2008; Taylor et al. 2000), and the Extended Reconstructed Sea Surface Temperatures, version 5 (ERSSTv5; Huang et al. 2017a,b). The baseline period used for all historical datasets is 1880–1919.
HadCRUT5 provides temperature anomalies on a latitude–longitude rectangular grid. Two variants of the same dataset are provided: a noninfilled version, with data in grid boxes where measurements are available; and a more spatially complete version. For global and regional time series, the HadCRUT5 analysis error model contains two terms (Morice et al. 2021): the analysis error (εa) and the coverage error (εc). The analysis error combines the errors from the Gaussian process used in the statistical infilling and the instrumental errors. The analysis grids are not generally globally complete, particularly in the early observed record. Regions are omitted where there are insufficient data available to form reliable gridcell estimates. The coverage error represents the uncertainty in spatial averages arising from these unrepresented regions. The analysis error is represented by the 200 realizations of the historical record, whereas the coverage error is reported as a time series of standard deviations. We use the more spatially complete version, also termed as “HadCRUT5 analysis.” The HadCRUT5 analysis dataset uses a statistical method to extend temperature anomaly estimates into regions for which the underlying measurements are informative. This makes it more suitable for comparisons of large-scale regional average diagnostics against spatially complete model data, although variability in “infilled” regions will be lower than where observed measurement data is present (Jones 2016). We use the HadCRUT5 analysis as a reference dataset for two of the indices: global mean and hemispheric gradient. We use the global means calculated by averaging the hemispheric means, as recommended by Morice et al. (2021).
The SST# index is defined as the difference between the average of the warmest 30% SSTs (actual values, not anomalies) and the domain average. The domain used for this particular metric is the tropics, from 30°S to 30°N. This index represents the difference in SSTs between the convective regions and the tropical average, and it explains the anomalies in low cloud cover (and cloud radiative feedbacks) over the historical record due to changes in SST patterns (Fueglistaler and Silvers 2021). The index is calculated using monthly mean SSTs, and then annual averages are calculated. The same process is followed for both models and observations. Since this index cannot be calculated from local anomalies, a dataset that provides absolute temperature estimates is required. The PCMDI dataset provides monthly mean sea surface temperature and sea ice concentration data from 1870 to the present on a regular latitude–longitude grid. These data are designed to be used as boundary conditions for atmosphere-only simulations. They use the AMIPII midmonth calculation (Taylor et al. 2000), which ensures that the monthly mean of the time-interpolated data is identical to the input monthly mean. Following the convention in other studies, we refer to this dataset as PCMDI/AMIPII. SST# is subject to a large observational uncertainty (Fueglistaler and Silvers 2021), attributed to the different methodologies used to provide information where observations are not available. Given that the PCMDI/AMIPII dataset does not provide a comprehensive error characterization, we use the ERSST5 to test the robustness of our results to the observational uncertainty in SST#. We have chosen the PCMDI/AMIPII and ERSST5 datasets because they fall at opposite ends of the spectrum of SST# anomalies provided by observational datasets, spanning the range of structural uncertainties in the observational reconstructions of SST#. There is evidence of differences between near-surface atmosphere temperature and surface temperature diagnostics (e.g., Richardson et al. 2016). The Intergovernmental Panel on Climate Change Sixth Assessment Report (IPCC AR6; Gulev et al. 2021) quantifies the global-mean uncertainty of long-term trends by at most 10% in either direction, with low confidence in the sign of any difference in long-term trends. Jones (2020) supports the use of global near-surface air temperature model diagnostics with blended datasets of observed temperature changes.
3. Methodology
Let Ho(t) be the time series of the observed historical record anomalies of any given surface temperature index. We decompose it as HO(t) = S(t) + UO(t) + EO(t), where S(t) represents the forced signal, UO(t) is the unforced variability, and EO(t) is the total observational error. Similarly, for a given model we decompose any historical simulation of the same index as HM(t) = S(t) + DM(t) + UM(t), where DM(t) represents a discrepancy term or error in the forced response, and UM(t) is the model’s unforced variability.
If we hypothesize that the model’s forced response is realistic [i.e., DM(t) = 0], then HM(t) − HO(t) = UM(t) − UO(t) − EO(t). We can test this hypothesis by comparing HM(t) − HO(t) with the expected distribution of UM(t) − UO(t) − EO(t). In general, we have more than one realization of a model’s historical experiment, each of them with a different realization of the model unforced variability. Since we only have a single sample of the real world’s unforced variability, tests on individual ensemble members are not independent. We avoid this problem by formulating the test for ensemble means noting that S(t) [and DM(t)] are the same for each ensemble member:
The problem is now reduced to the characterization of the distribution of the right-hand side of the equation. Ideally, UO should be characterized from a long time series of the real system under no external forcing. Paleoclimatic proxy reconstructions are available only for restricted regions, and therefore not representative of the large spatial scales of interest for this study, as well as having larger errors. They have the additional complication that the external forcing is not zero during the paleoclimate record. Therefore, we instead assume that unforced simulations of the multimodel ensemble provide us with a reasonable estimate of the real world’s unforced variability, an approach that has been used in other studies (e.g., Gillett et al. 2002). Hence, we characterize
The subsections below describe the next steps in the methodology: calculation of the observational error term; estimation of the distribution of
a. Calculation of the observational error
For the HadCRUT5 observations, we combine analysis and coverage errors into a single error term (Eo) as follows. We add samples of a normally distributed random variable of zero mean and variance Var[εc(t)] to the residuals of the 200 realizations of the HadCRUT5 analysis. The total error inherits the autocorrelation characteristics of the analysis error, which is correlated in time. The error term Eo is then modeled by drawing random samples from this 200-member ensemble of realizations. The time dependence of Eo for the global mean is shown in Fig. 1. The black lines show the 95% confidence interval [comparable to the orange range in Fig. 2 of Morice et al. (2021)]. In general, the observational error decreases with time, apart from periods of international conflicts. The time dependence of Eo for the hemispheric difference is very similar to that of the global mean, but larger in magnitude.
For the SST# index, we do not include an error term due to lack of error information in the observational datasets. However, we repeat the analysis with two different observational datasets to test the robustness of the results.
b. Construction of the unforced distribution of differences
Here we are concerned with the generation of random samples of
We split the detrended control time series X(t) into nonoverlapping segments 165 years long, equal to the length of the CMIP6 historical simulations. The piControl simulations differ in length between models, so to give (nearly) equal weight to each model we use up to 3 segments of each piControl simulation. We also decide to retain models with shorter control time series. With these constraints, we use 41 piControl simulations, 32 of them with 3 segments, 5 with 2 segments, and 4 with only 1 segment. This gives 110 segments of piControl simulations of equal length. Then, we subtract the time average of the segment, so that the mean value of each segment is zero by construction. We call UpiControl(t) to these detrended, 165-yr-long, zero-average piControl samples of the unforced variability, which we use to generate samples of
Other approaches for estimating internal variability exist, and a recent study by Olonscheck and Notz (2017) provides a brief description of the two main avenues and their caveats. We have used a method that is based on piControl simulations, which may be unsuitable if the unforced variability is state dependent. However, Olonscheck and Notz (2017) show that the variability remains largely unchanged for historical simulations, even for those variables like sea ice area that show large changes in simulations of future warming. Therefore, we assume that the variability remains unchanged for the temperature indices used here and the amount of climate change in the historical period.
c. Definition of the metric: Number of exceedances
Our interest is to characterize the quality of a historical ensemble of simulations against observations. As a metric of quality, in the next section 3d we compute
The samples of
We define E(T, y, Nm), as the number of exceedances above a threshold T (in K) of a filtered time series of absolute values of
Figure 2 presents an example of this process for the GMSAT, leading to the calculation of one sample of E(0.1, 10, 5). The blue line shows one sample of
We construct a second metric following the same steps, but using the variance-scaled samples
From these sets of samples of E(T, y, Nm) and Es(T, y, Nm), we construct empirical quantile distribution functions QZ(p; T, y, Nm), which give the number of exceedances for a given cumulative probability p. Z is a generic discrete random variable name that refers to either E or Es. For simplicity, from now on we omit the dependency with the ensemble size Nm.
In summary, for each historical ensemble, we have calculated two (one with variance scaling and one without) empirical quantile functions in each point of the (T, y) grid. Figure 3 shows examples of QE for a historical ensemble of 3 members. For a given T and y, the probability is p that the number of exceedances (occurring during a 165-yr historical integration) will be less than QZ(p; T, y). There is zero chance that the number of exceedances will be less than zero, a small chance that it will be less than a small number, and we are certain that it will be less than a sufficiently large number (at most 165). Thus, QZ increases with p (Figs. 3a,b). For any given p, the expected number of exceedances QZ is smaller for a longer meaning period y (Fig. 3a) or a higher threshold T (Fig. 3b).
d. Testing ensembles of historical simulations
We test each historical ensemble by comparing the number of exceedances of the difference between the ensemble mean and the observations against the expected number of differences given by the control distribution. First, we calculate
Figure 4 shows an example for the upper-tail test applied to the entire (T, y) grid, using a significance level α = 0.05. For illustrative purposes, it is helpful to choose a model like EC-Earth3-Veg with large multidecadal unforced variability (Parsons et al. 2020). The filled contours in Figs. 4a and 4c show QZ(p = 0.95; T, y) for Z = E in and Z = Es, respectively. The shape of QZ is very similar for all models and ensemble sizes. As shown also in Fig. 3, QZ gets smaller as T gets larger for a given y (less likely to exceed a higher threshold), and smaller as y gets larger for a given T (less likely to for a longer time mean to exceed a threshold), although the dependency on y is much weaker.
The dotted regions in the (T, y) grid mark where the null hypothesis is rejected (Eh > QZ). In this example, the test without variance scaling (Fig. 4a) shows many rejections, whereas the variance-scaled test (Fig. 4c) shows none. This contrast implies that the unforced variance of EC-Earth3-Veg is larger than the multimodel mean variance. The large variance increases the number of exceedances in the test without variance scaling, whereas variance scaling raises the control surface QZ(p = 0.95; T, y), making it easier for the model to pass the test. This scaling is trying to penalize those models that pass the nonscaled test due to a very small unforced variability compared to the multimodel mean variance, which we assume to be the best estimate of the unforced variability.
How much of the (T, y) space is needed to fail the statistical test for the model as a whole to be deemed “incompatible”? We have divided the (T, y) grid into 29 × 10 points, so we would expect a good model to fail at 290 × 10 × 0.05 ≈ 15 points in the (T, y) grid just by chance if there was no correlation in the number of exceedances between (T, y) neighbors. Because the time scale and the threshold are correlated, if incompatibility occurs it is likely to cover patches of adjacent points in the (T, y) grid. Given that our main aim is to apply this method to intercompare models, we do not define a single, strict threshold for labeling a model as incompatible with the observations. Instead, we use the following guidance: models with less than 10 failures (dots) pass the test; models that fail between 10 and 20 times are considered marginal; models with more than 20 failures are labeled as incompatible.
The lower-tail test [Eh(T, y) < QZ(p = 0.05; T, y)] can be presented in a similar way, but only one of the models tested fails this test (FGOALS-g3, and only marginally). A model fails this test if its historical simulation deviates less than expected from reality, which can happen only if it has both a realistic forced response and unrealistically small unforced variability. It could be that the lower-tail test rarely fails because models in general do not have a realistic forced response. For the remainder of the paper, we discuss the results of the upper-tail test only.
We have tested the sensitivity of the results to the order of the polynomial used for the detrending of the piControl time series. The results are largely insensitive to the use of quadratic instead of linear detrending, so we conclude that our method is robust with respect to the detrending method. If this test is applied to metrics that require nonlinear detrending we would recommend the use of more flexible methods with better properties (e.g., splines).
4. Results and discussion
In this section we present results for three temperature indices: global mean, hemispheric difference, and SST#. These three metrics capture important complementary information about key aspects of temperature change over the historical record. The global mean has been widely used as the most fundamental metric of climate change. The hemispheric difference captures the influence of anthropogenic aerosols during the historical period, as emissions are dominated by sources in the Northern Hemisphere, and it is reasonably independent of the global mean (Braganza et al. 2003). The changes in tropical SST pattern control the sign and strength of low cloud feedbacks in response to CO2 forcing (e.g., Miller 1997; Gregory and Andrews 2016), making it an important metric of the historical record.
a. Global mean
Figure 5 shows the tests without variance scaling. Out of the 40 models analyzed, 20 of them can be labeled as incompatible with the observed record, according to this test. These are models that show large, dotted areas. The other 20 models do not fail the test at all or only in a few instances. Models tend to fail the test for large exceedance thresholds T, with little dependence in the length of the averaging window y, that is, they tend to fail along entire “columns” in the contour plot.
When the variance-scaled test is applied (Fig. 6), 22 models are labeled as incompatible with the observed record, and 18 models pass the test. No models are in the marginal category. The variance-scaled test rejects 5 additional models, and labels as compatible 3 models that were rejected by the test without variance scaling. This is because these models have a piControl variance that is very different to the multimodel mean variance.
We have presented an example of a model with large unforced variability in Fig. 4. Figure 7 shows an example for a model with a small unforced variability: MRI-ESM2-0. The control surface of the number of exceedances is lowered by the variance scaling, making it easier for the model to fail the test. Since we are not making any assumption about the quality of piControl simulations of individual models, the variance scaling method is an attempt to enable a fair comparison, when using other models with different unforced variability.
These two examples show how each model’s characteristics of its unforced variability are incorporated into the test. This is particularly helpful when the ensemble size of historical simulations is small, which makes difficult the assessment of the impact of the unforced variability by visual inspection. It must be emphasized that we treat all piControl simulations as equally plausible, but the method could be refined by bringing in external information to better characterize the unforced variability of the real system. We expand on this below when we discuss the caveats of the methodology.
b. Hemispheric gradient
Figure 8 shows the tests without variance scaling for the hemispheric gradient index. Out of the 40 models analyzed, 8 are labeled as incompatible with the observed record, 1 is marginal, and 31 pass the test. If variance scaling is used (Fig. 9), the results are very similar, with 7 models rejected, 3 marginal, and 29 passing the test. As with the global mean, failures tend to happen along “columns,” that is, for all averaging window lengths. It is interesting to note that, contrary to the global mean, the hemispheric gradient shows more failures for small exceedance thresholds.
In CESM2 there is a strong sensitivity of the hemispheric gradient to the variability in biomass emissions from 40° to 70°N, which leads to spurious warming in the late historical period (Fasullo et al. 2022). However, this model passes the global and hemispheric tests, which may suggest the presence of compensating biases. This highlights the importance of having a large battery of diagnostics capable of assessing model performance from different angles.
c. SST#
Figures 10 and 11 show the multimodel ensemble results for SST#, without and with scaling of the unforced variance, respectively. The test without variance scaling rejects all CMIP6 models. Only one model is not rejected, namely GISS-E2-1-G, when variance scaling is used. The GISS models are examples of models with large unforced variability (Fig. 12). Unlike in previous examples with large unforced variability on long time scales (Fig. 4), the unforced variability of the GISS models is dominated by high-frequency (annual) variability. Given that the observational record does not show such a large high-frequency variability, we conclude that the test without variance scaling is probably a better assessment of the performance of the GISS models. This conclusion is also supported by Orbe et al. (2020) who show that GISS-E2-1-G is an outlier in the simulation of ENSO.
SST# is subject to a large observational uncertainty (Fueglistaler and Silvers 2021). The observations show very good agreement during the satellite era (1979 onward), where the spatial coverage is very dense, but they show large discrepancies before satellite data were available. The differences are attributed to the different methodologies used to provide information where observations are not available. Given that the PCMDI/AMIPII dataset does not provide a comprehensive error characterization, we have repeated the tests using the ERSST5 dataset to test the robustness of our conclusions. We have chosen the PCMDI/AMIPII and ERSST5 datasets because they fall at opposite ends of the spectrum of SST# anomalies provided by observational datasets, giving us information about structural uncertainties in the observational reconstructions of SST#. The results with ERSST5 (not shown) are similar to the comparisons against PCMDI/AMIPII, all the CMIP6 models are rejected by both tests, with and without variance scaling. This confirms that the results are robust with respect to observational uncertainty in SST#.
The fact that the entire CMIP6 ensemble performs poorly in the SST# index is consistent with previous studies showing that models in general do not reproduce the Pacific SST trends of recent decades (Seager et al. 2019; Gregory et al. 2020; Wills et al. 2022), and it has potential implications beyond the models’ performance over the historical period.
Unlike for the two other indices, there is no consensus either that SST# should contain a forced signal or that it is part of the unforced variability of the climate system. Some recent studies suggest that tropical Pacific SST patterns observed during the recent decades could arise from internal climate variability (e.g., Olonscheck et al. 2020; Watanabe et al. 2021). Other studies suggest that the SST patterns are consistent with a forced response to greenhouse forcing (Seager et al. 2019) that can be explained with simple models (Clement et al. 1996), or with a potential role for volcanic or anthropogenic aerosols in setting the recent patterns (Gregory et al. 2020; Heede and Fedorov 2021; Dittus et al. 2021). If the observed evolution of SST# is not forced, no model ensemble mean can be expected to agree with the observations. In that case, if a model fails the test, it means that its simulation of SST# variability has the wrong magnitude. On the other hand, if SST# is forced, the rejection of the test means that the model does not replicate the forced response. In this case, if a large number of models fail the test it could imply a common bias in the forced response. In either case, a rejection of the test indicates some aspects of the model performance are wrong somehow. Additional process-level analysis and physical hypothesis testing is required to improve our understanding of the causes behind the model errors.
d. Caveats and interpretation of the tests
The results above show how the methodology presented here can be used to assess historical simulations during the model development process. We have applied it to surface temperature indices, but it can be applied to any variable for which observational estimates over the historical period exist. However, the methodology presents some interpretation challenges and caveats. How do we interpret a rejection of the null hypothesis that the model’s forced response is realistic? Can we definitively conclude that there is a problem with the model’s forced signal? There is a chance that the null hypothesis is wrongly rejected although true; that is a type I error, whose probability is the chosen significance level. If we reject the null hypothesis, we must have an alternative hypothesis. Potential alternatives are as follows: there is a problem with the model’s forced signal; our model-based unforced variability is biased; the forcing is wrong. We do not have a statistical means to estimate the probability of these systematic errors.
It is also worth mentioning that agreement between the observations and simulations might be due to compensating errors. Potential problems that could contribute to compensating errors concern the following: aerosol radiative forcing and aerosol–cloud interactions (e.g., Paulot et al. 2018; Rieger et al. 2020; Wang et al. 2021; Fasullo et al. 2022); tropical SST patterns and their role on global radiative feedbacks (Ceppi and Gregory 2017; Andrews and Webb 2018). The unforced distributions used to define the exceedance quantile functions are constructed from piControl simulations. This assumes that the multimodel ensemble provides us with a good representation of the unforced variability, which is not necessarily true. As we have shown above when discussing the results of the variance-scaled results, there exist large discrepancies in the representation of unforced variability between models (Parsons et al. 2020), which raises questions about the ability of at least some models to provide a good estimate of unforced variability. If the unforced variability estimated from the multimodel ensemble is biased, then our method will be biased. One avenue that could be explored for improving this would be to incorporate information from proxy temperature reconstructions into a correction of the unforced variability. However, the use of proxy reconstructions is not free from problems. The reconstructions are for restricted regions where there are proxies (e.g., PAGES 2k Consortium 2013), and much of their variability is forced by volcanoes and solar variability (PAGES 2k Consortium 2019). In any case, a failure of this type would imply that the models piControl simulations are wrong (rather than the forced signal necessarily), so the test would still be highlighting a problem.
Our test with scaled variance is an initial attempt to identify outliers, but more sophisticated methods could be used. Perhaps a better estimate of the unforced variability could be achieved by restricting the set of models used to form the distributions of internal variability. This selection could be based on how models represent observational estimates of the spectra of some modes of variability (Fasullo et al. 2020). For SST#, basing this selection on some metric of ENSO could be particularly useful (Planton et al. 2021). Screening out models would reduce the number of piControl simulations, so this would have an impact the robustness of the unforced distributions.
A second caveat is the differing sizes of the historical ensembles. Out of the 40 models analyzed here, only 4 have historical ensembles with more than 10 members, and 31 models have 5 or fewer historical simulations. Large ensembles will provide more robust tests. A model with a small ensemble will provide a less precise estimate of the ensemble mean, making the result of the test more likely to be different from the result that would be obtained with a large ensemble. This is a general problem with statistical hypothesis testing, and it should be incorporated into the subjective interpretation of the tests. We propose some guidance based on the dependence of the variance of the control distribution with the size of the ensemble. As explained above, the control distribution is constructed from samples of
We do not account for the uncertainty in radiative forcing, which could lead to overturning if the only objective is to match the warming over the historical period (e.g., Hourdin et al. 2017). However, we are not advocating making development choices only based on the approach presented here. A wide range of other metrics, including process-based metrics need to be considered. The use of a much wider basket of metrics should reduce the risk of overturning.
A final caveat is that the variance scaling cannot account for differences in models’ piControl variability on different time scales, so while the overall variability of two models can be scaled to be similar the interannual/multidecadal variability could be still very different. We have subjectively accounted for this in the discussion of the SST# results for GISS-E2-1-G, whose variability is dominated by large interannual variability, which can be confidently assessed with observations of the historical period. However, for variability at much longer time scales, the observational record provides very limited information. A possible approach to look at in the future is to account for this by applying different variance scaling factors for each value of the averaging window.
5. Conclusions
The historical record of surface temperature is an important metric that climate models should be able to reproduce. However, it is not consistently used by modeling centers during model development for two main and quite distinct reasons: first, coupled simulations are expensive to run, especially because the historical simulation must be preceded by a spinup simulation long enough to eliminate drift; second, the observed historical record of surface temperature is reserved as an out-of-sample validation. It is generally argued that the warming during the historical record and emergent properties like equilibrium climate sensitivity should be used as an a posteriori evaluation and not as a target for model development, although there is not complete consensus among the modeling community on this topic (Hourdin et al. 2017). Bock et al. (2020) highlight the risk of tuning models to reproduce a set of metrics while ignoring deficiencies elsewhere. However, this risk is not specific to metrics based on historical warming. Within the context of emergent constraints, Eyring et al. (2019) advocate the use of variability metrics or trends during model development.
We develop a statistical method to test whether simulations of large-scale surface temperature change are consistent with the observed warming of the historical period (1850–2014). The method uses information on a range of time scales. It incorporates information about unforced variability, and it is designed to test an entire ensemble of simulations of any size. The method is applied to annual-mean time series of three surface temperature indices: global mean, hemispheric gradient, and a recently developed index that captures the sea surface temperature (SST) pattern in the tropics (SST#; Fueglistaler and Silvers 2021). We test the historical simulations of the CMIP6 ensemble and post-CMIP6 versions of the HadGEM3 and UKESM models.
Around half the models fail the test for the global-mean time series, approximately a fifth of the models fail when the hemispheric temperature gradient is analyzed, and all models fail the SST# test. We note the importance of the characteristics of the models’ unforced variability (Parsons et al. 2020). Assessment of the quality of the historical simulations by visual comparison of the time series of a few ensemble members against the observations can be misleading, being reliable only for models with a large number of historical realizations. The method presented here complements other statistical approaches that have previously compared historical model simulations to observations (e.g., Sanderson et al. 2015; Brunner et al. 2020; Suarez-Gutierrez et al. 2021). Given that most modeling centers only run a small number of historical simulations, a method like the one presented here that accounts for the unforced variability is desirable, especially if the aim is to use it during the model development process, where large ensembles are not affordable.
We show that the method presented here can be used as a tool to assess historical simulations during the development process. The method is easy to apply and summarizes a large amount of information in two plots, with and without variance scaling. It accounts for the unforced variability of the model tested, and it can be applied to an ensemble of historical simulations of arbitrary size. We also plan to make this methodology available to the community by implementing it in the Earth System Model Evaluation Tool (ESMValTool; Eyring et al. 2020).
There are several avenues that could be explored to develop this method further. One potential improvement could be to incorporate information from proxy reconstructions to improve the estimate of the unforced variability, currently based on control model simulations. However, this may prove difficult given that many proxies do not resolve annual variability, and because of the nonstationarity of the magnitude of internal variability. Perhaps a better estimate of the unforced variability could be achieved by restricting the model set used to form the distributions of internal variability based on how models represent observational estimates of annual to decadal modes of variability (Fasullo et al. 2020).
A second area for further developments could be to apply a scaling factor, as it is done in optimal fingerprinting (e.g., Allen and Tett 1999). Some of the models that are rejected by our current methodology could pass the test if they are appropriately scaled. The interpretation of the test results with the scaled time series is not straight forward, but it may be useful to know that a model that is rejected could be made realistic by a scaling factor. Despite these caveats, the generality of the method presented here and the fact that it incorporates information about the unforced variability makes it a useful tool for the assessment of historical simulations.
Acknowledgments.
This work was supported by the Met Office Hadley Centre Climate Programme funded by BEIS. We thank an anonymous reviewer, J. T. Fasullo and R. C. J. Wills for their constructive comments that helped improve the original manuscript. We thank Gareth Jones for his contribution to the methodology and useful comments on an early draft version of the manuscript. We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6. We thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP6 and ESGF.
Data availability statement.
HadCRUT.5.0.1.0 data were obtained from http://www.metoffice.gov.uk/hadobs/hadcrut5 on 15 February 2021 and are British Crown Copyright, Met Office 2021, provided under an Open Government License, http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/. PCMDI AMIP SSTs were obtained from the ESGF archive, variable tosbcs from input4MIPs, version v20220201. NOAA_ERSST_V5 data provided by NOAA/OAR/ESRL PSL, Boulder, Colorado, were accessed at https://psl.noaa.gov/data/gridded/data.noaa.ersst.v5.html (Huang et al. 2017a,b).
REFERENCES
Allen, M. R., and S. F. B. Tett, 1999: Checking for model consistency in optimal fingerprinting. Climate Dyn., 15, 419–434, https://doi.org/10.1007/s003820050291.
Andrews, T., and M. J. Webb, 2018: The dependence of global cloud and lapse rate feedbacks on the spatial structure of tropical Pacific warming. J. Climate, 31, 641–654, https://doi.org/10.1175/JCLI-D-17-0087.1.
Annamalai, H., J. Hafner, K. P. Sooraj, and P. Pillai, 2013: Global warming shifts the monsoon circulation, drying South Asia. J. Climate, 26, 2701–2718, https://doi.org/10.1175/JCLI-D-12-00208.1.
Bock, L., and Coauthors, 2020: Quantifying progress across different CMIP phases with the ESMValTool. J. Geophys. Res. Atmos., 125, e2019JD032321, https://doi.org/10.1029/2019JD032321.
Boucher, O., and Coauthors, 2020: Presentation and evaluation of the IPSL-CM6A-LR climate model. J. Adv. Model. Earth Syst., 12, e2019MS002010, https://doi.org/10.1029/2019MS002010.
Braganza, K., D. Karoly, A. Hirst, M. Mann, P. Stott, R. Stouffer, and S. Tett, 2003: Simple indices of global climate variability and change: Part I—Variability and correlation structure. Climate Dyn., 20, 491–502, https://doi.org/10.1007/s00382-002-0286-0.
Brunner, L., A. G. Pendergrass, F. Lehner, A. L. Merrifield, R. Lorenz, and R. Knutti, 2020: Reduced global warming from CMIP6 projections when weighting models by performance and independence. Earth Syst. Dyn., 11, 995–1012, https://doi.org/10.5194/esd-11-995-2020.
Ceppi, P., and J. M. Gregory, 2017: Relationship of tropospheric stability to climate sensitivity and Earth’s observed radiation budget. Proc. Natl. Acad. Sci. USA, 114, 13 126–13 131, https://doi.org/10.1073/pnas.1714308114.
Clement, A. C., R. Seager, M. A. Cane, and S. E. Zebiak, 1996: An ocean dynamical thermostat. J. Climate, 9, 2190–2196, https://doi.org/10.1175/1520-0442(1996)009<2190:AODT>2.0.CO;2.
Dittus, A. J., E. Hawkins, L. J. Wilcox, R. T. Sutton, C. J. Smith, M. B. Andrews, and P. M. Forster, 2020: Sensitivity of historical climate simulations to uncertain aerosol forcing. Geophys. Res. Lett., 47, e2019GL085806, https://doi.org/10.1029/2019GL085806.
Dittus, A. J., E. Hawkins, J. I. Robson, D. M. Smith, and L. J. Wilcox, 2021: Drivers of recent North Pacific decadal variability: The role of aerosol forcing. Earth’s Future, 9, e2021EF002249, https://doi.org/10.1029/2021EF002249.
Eyring, V., S. Bony, G. A. Meehl, C. A. Senior, B. Stevens, R. J. Stouffer, and K. E. Taylor, 2016: Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev., 9, 1937–1958, https://doi.org/10.5194/gmd-9-1937-2016.
Eyring, V., and Coauthors, 2019: Taking climate model evaluation to the next level. Nat. Climate Change, 9, 102–110, https://doi.org/10.1038/s41558-018-0355-y.
Eyring, V., and Coauthors, 2020: Earth System Model Evaluation Tool (ESMValTool) v2.0—An extended set of large-scale diagnostics for quasi-operational and comprehensive evaluation of Earth system models in CMIP. Geosci. Model Dev., 13, 3383–3438, https://doi.org/10.5194/gmd-13-3383-2020.
Fasullo, J. T., A. S. Phillips, and C. Deser, 2020: Evaluation of leading modes of climate variability in the CMIP archives. J. Climate, 33, 5527–5545, https://doi.org/10.1175/JCLI-D-19-1024.1.
Fasullo, J. T., Lamarque, J.-F., Hannay, C., Rosenbloom, N., Tilmes, S., DeRepentigny, P. A. Jahn, and C. Deser, 2022: Spurious late historical-era warming in CESM2 driven by prescribed biomass burning emissions. Geophys. Res. Lett., 49, e2021GL097420, https://doi.org/10.1029/2021GL097420.
Flynn, C. M., and T. Mauritsen, 2020: On the climate sensitivity and historical warming evolution in recent coupled model ensembles. Atmos. Chem. Phys., 20, 7829–7842, https://doi.org/10.5194/acp-20-7829-2020.
Fueglistaler, S., and L. G. Silvers, 2021: The peculiar trajectory of global warming. J. Geophy. Res. Atmos., 126, e2020JD033629, https://doi.org/10.1029/2020JD033629.
Gillett, N. P., F. W. Zwiers, A. J. Weaver, G. C. Hegerl, M. R. Allen, and P. A. Stott, 2002: Detecting anthropogenic influence with a multi-model ensemble. Geophys. Res. Lett., 29, 1970, https://doi.org/10.1029/2002GL015836.
Golaz, J.-C., and Coauthors, 2019: The DOE E3SM coupled model version 1: Overview and evaluation at standard resolution. J. Adv. Model. Earth Syst., 11, 2089–2129, https://doi.org/10.1029/2018MS001603.
Gregory, J. M., and T. Andrews, 2016: Variation in climate sensitivity and feedback parameters during the historical period. Geophys. Res. Lett., 43, 3911–3920, https://doi.org/10.1002/2016GL068406.
Gregory, J. M., T. Andrews, P. Ceppi, T. Mauritsen, and M. J. Webb, 2020: How accurately can the climate sensitivity to CO2 be estimated from historical climate change? Climate Dyn., 54, 129–157, https://doi.org/10.1007/s00382-019-04991-y.
Gulev, S. K., and Coauthors, 2021: Changing state of the climate system. Climate Change 2021: The Physical Science Basis, V. Masson-Delmotte et al., Eds., Cambridge University Press, 287–422, https://www.ipcc.ch/report/ar6/wg1/downloads/report/IPCC_AR6_WGI_Chapter02.pdf.
Heede, U. K., and A. V. Fedorov, 2021: Eastern equatorial Pacific warming delayed by aerosols and thermostat response to CO2 increase. Nat. Climate Change, 11, 696–703, https://doi.org/10.1038/s41558-021-01101-x.
Hourdin, F., and Coauthors, 2017: The art and science of climate model tuning. Bull. Amer. Meteor. Soc., 98, 589–602, https://doi.org/10.1175/BAMS-D-15-00135.1.
Huang, B., and Coauthors, 2017a: Extended Reconstructed Sea Surface Temperature, version 5 (ERSSTv5): Upgrades, validations, and intercomparisons. J. Climate, 30, 8179–8205, https://doi.org/10.1175/JCLI-D-16-0836.1.
Huang, B., and Coauthors, 2017b: NOAA Extended Reconstructed Sea Surface Temperature (ERSST), version 5. NOAA National Centers for Environmental Information, accessed 1 September 2021, https://doi.org/10.7289/V5T72FNM.
Hurrell, J. W., J. J. Hack, D. Shea, J. M. Caron, and J. Rosinski, 2008: A new sea surface temperature and sea ice boundary dataset for the Community Atmosphere Model. J. Climate, 21, 5145–5153, https://doi.org/10.1175/2008JCLI2292.1.
Jones, G. S., 2020: “Apples and oranges”: On comparing simulated historic near-surface temperature changes with observations. Quart. J. Roy. Meteor. Soc., 146, 3747–3771, https://doi.org/10.1002/qj.3871.
Jones, G. S., P. A. Stott, and N. Christidis, 2013: Attribution of observed historical near–surface temperature variations to anthropogenic and natural causes using CMIP5 simulations. J. Geophys. Res. Atmos., 118, 4001–4024, https://doi.org/10.1002/jgrd.50239.
Jones, P., 2016: The reliability of global and hemispheric surface temperature records. Adv. Atmos. Sci., 33, 269–282, https://doi.org/10.1007/s00376-015-5194-4.
Mauritsen, T., and Coauthors, 2019: Developments in the MPI‐M Earth system model version 1.2 (MPI‐ESM1.2) and its response to increasing CO2. J. Adv. Model. Earth Syst., 11, 998–1038, https://doi.org/10.1029/2018MS001400.
McKinnon, K. A., and C. Deser, 2018: Internal variability and regional climate trends in an observational large ensemble. J. Climate, 31, 6783–6802, https://doi.org/10.1175/JCLI-D-17-0901.1.
Miller, R. L., 1997: Tropical thermostats and low cloud cover. J. Climate, 10, 409–440, https://doi.org/10.1175/1520-0442(1997)010<0409:TTALCC>2.0.CO;2.
Morice, C. P., and Coauthors, 2021: An updated assessment of near-surface temperature change from 1850: the HadCRUT5 data set. J. Geophys. Res. Atmos., 126, e2019JD032361, https://doi.org/10.1029/2019JD032361.
Mulcahy, J. P., and Coauthors, 2018: Improved aerosol processes and effective radiative forcing in HadGEM3 and UKESM1. J. Adv. Model. Earth Syst., 10, 2786–2805, https://doi.org/10.1029/2018MS001464.
Mulcahy, J. P., and Coauthors, 2023: UKESM1.1: Development and evaluation of an updated configuration of the UK Earth System Model. Geosci. Model Dev., https://doi.org/10.5194/gmd-2022-113, in press.
Olonscheck, D., and D. Notz, 2017: Consistently estimating internal climate variability from climate model simulations. J. Climate, 30, 9555–9573, https://doi.org/10.1175/JCLI-D-16-0428.1.
Olonscheck, D., M. Rugenstein, and J. Marotzke, 2020: Broad consistency between observed and simulated trends in sea surface temperature patterns. Geophys. Res. Lett., 47, e2019GL086773, https://doi.org/10.1029/2019GL086773.
Orbe, C., and Coauthors, 2020: Representation of modes of variability in six U.S. climate models. J. Climate, 33, 7591–7617, https://doi.org/10.1175/JCLI-D-19-0956.1.
PAGES 2k Consortium, 2013: Continental-scale temperature variability during the past two millennia. Nat. Geosci., 6, 339–346, https://doi.org/10.1038/ngeo1797.
PAGES 2k Consortium, 2019: Consistent multidecadal variability in global temperature reconstructions and simulations over the Common Era. Nat. Geosci., 12, 643–649, https://doi.org/10.1038/s41561-019-0400-0.
Parsons, L. A., M. K. Brennan, R. C. J. Wills, and C. Proistosescu, 2020: Magnitudes and spatial patterns of interdecadal temperature variability in CMIP6. Geophys. Res. Lett., 47, e2019GL086588, https://doi.org/10.1029/2019GL086588.
Paulot, F., D. Paynter, P. Ginoux, V. Naik, and L. W. Horowitz, 2018: Changes in the aerosol direct radiative forcing from 2001 to 2015: Observational constraints and regional mechanisms. Atmos. Chem. Phys., 18, 13 265–13 281, https://doi.org/10.5194/acp-18-13265-2018.
Planton, Y. Y., and Coauthors, 2021: Evaluating climate models with the CLIVAR 2020 ENSO metrics package. Bull. Amer. Meteor. Soc., 102, E193–E217, https://doi.org/10.1175/BAMS-D-19-0337.1.
Reichler, T., and J. Kim, 2008: How well do coupled models simulate today’s climate? Bull. Amer. Meteor. Soc., 89, 303–312, https://doi.org/10.1175/BAMS-89-3-303.
Richardson, M., K. Cowtan, E. Hawkins and M. B. Stolpe, 2016: Reconciled climate response estimates from climate models and the energy budget of Earth. Nat. Climate Change, 6, 931–935, https://doi.org/10.1038/nclimate3066.
Rieger, L. A., J. N. S. Cole, J. C. Fyfe, S. Po-Chedley, P. J. Cameron-Smith, P. J. Durack, N. P. Gillett, and Q. Tang, 2020: Quantifying CanESM5 and EAMv1 sensitivities to Mt. Pinatubo volcanic forcing for the CMIP6 historical experiment. Geosci. Model Dev., 13, 4831–4843, https://doi.org/10.5194/gmd-13-4831-2020.
Sanderson, B. M., R. Knutti, and P. Caldwell, 2015: A representative democracy to reduce interdependency in a multimodel ensemble. J. Climate, 28, 5171–5194, https://doi.org/10.1175/JCLI-D-14-00362.1.
Seager, R., M. Cane, N. Henderson, D.-E. Lee, R. Abernathey, and H. Zhang, 2019: Strengthening tropical Pacific zonal sea surface temperature gradient consistent with rising greenhouse gases. Nat. Climate Change, 9, 517–522, https://doi.org/10.1038/s41558-019-0505-x.
Sen Gupta, A., N. C. Jourdain, J. N. Brown, and D. Monselesan, 2013: Climate drift in the CMIP5 models. J. Climate, 26, 8597–8615, https://doi.org/10.1175/JCLI-D-12-00521.1.
Smith, C. J., and Coauthors, 2021: Energy budget constraints on the time history of aerosol forcing and climate sensitivity. J. Geophys. Res. Atmos., 126, e2020JD033622, https://doi.org/10.1029/2020JD033622.
Suarez-Gutierrez, L., S. Milinski, and N. Maher, 2021: Exploiting large ensembles for a better yet simpler climate model evaluation. Climate Dyn., 57, 2557–2580, https://doi.org/10.1007/s00382-021-05821-w.
Taylor, K. E., D. Williamson, and F. Zwiers, 2000: The sea surface temperature and sea ice concentration boundary conditions for AMIP II simulations. PCMDI Rep. 60, 28 pp., https://pcmdi.llnl.gov/report/pdf/60.pdf.
Wang, C., B. J. Soden, W. Yang, and G. A. Vecchi, 2021: Compensation between cloud feedback and aerosol-cloud interaction in CMIP6 models. Geophys. Res. Lett., 48, e2020GL091024, https://doi.org/10.1029/2020GL091024.
Watanabe, M., J.-L. Dufresne, Y. Kosaka, T. Mauritsen, and H. Tatebe, 2021: Enhanced warming constrained by past trends in equatorial Pacific sea surface temperature gradient. Nat. Climate Change, 11, 33–37, https://doi.org/10.1038/s41558-020-00933-3.
Wills, R. C. J., Y. Dong, C. Proistosecu, K. C. Armour, and D. S. Battisti, 2022: Systematic climate model biases in the large-scale patterns of recent sea-surface temperature and sea-level pressure change. Geophys. Res. Lett., 49, e2022GL100011, https://doi.org/10.1029/2022GL100011.
Zhang, J., and Coauthors, 2021: The role of anthropogenic aerosols in the anomalous cooling from 1960 to 1990 in the CMIP6 Earth system models. Atmos. Chem. Phys., 21, 18 609–18 627, https://doi.org/10.5194/acp-21-18609-2021.
Zinke, J., S. A. Browning, A. Hoell, and I. D. Goodwin, 2021: The west Pacific gradient tracks ENSO and zonal Pacific sea surface temperature gradient during the last millennium. Sci. Rep., 11, 20395, https://doi.org/10.1038/s41598-021-99738-3.