## Abstract

In a recent article, Cheung et al. applied a semiempirical methodology to isolate internal climate variability (ICV) in CMIP5 models and observations. The essence of their methodology is to subtract the scaled CMIP5 multimodel ensemble mean (MMEM) from individual model simulations and from the observed time series of several surface temperature indices. Cheung et al. detected large differences in both the magnitude and spatial patterns of the observed and simulated ICV, as well as large differences between the historical (simulated) ICV and preindustrial (PI) control CMIP5 simulations. Here it is shown that subtraction of the scaled MMEM from CMIP5 historical simulations produces a poor estimate of the modeled ICV due to the difference between the scaled MMEM and a given model’s true forced signal masquerading as ICV. The resulting phase and amplitude errors of the ICV so estimated are large, which compromises most of Cheung et al.’s conclusions pertaining to characterization of ICV in the historical CMIP5 simulations. By contrast, an alternative methodology based on forced signals computed from individual model ensembles produces a much more accurate estimate of the ICV in CMIP5 models, whose magnitude is consistent with the PI control simulations and is much smaller than any of the semiempirical estimates of the observed ICV on decadal and longer time scales.

## 1. Introduction

Accurate identification of internal (unforced) climate variability (ICV) is essential for understanding and predicting long-term climate change as apparent, for example, in a variety of the Northern Hemisphere surface temperature indices. One of the ways to isolate ICV in climate model historical simulations is to utilize ensembles of simulations available through phase 5 of the Coupled Model Intercomparison Project (CMIP5; Taylor et al. 2012). Here, averaging historical simulations over multiple realizations of a climate-index time series of a given model provides an estimate of the forced signal, while differencing each simulation with the forced signal so computed provides an estimate of the simulated ICV. Steinman et al. (2015a,b) correctly noted that forced signals inferred via such single-model ensemble means (SMEM) are contaminated by insufficient averaging of ICV—as a result of the small number of historical realizations typically available in individual model ensembles—and argued that additional averaging across the entire multimodel ensemble [the multimodel ensemble mean (MMEM)] is required for proper estimation of the true forced signal. They claimed that the MMEM rescaled—via linear regression, to match either the observed or simulated climate change—provides unbiased estimates of the forced signal in observations and CMIP5 historical simulations, and can thus be used to obtain unbiased estimates of (both observed and simulated) residual ICV. Along the same lines, Frankcombe et al. (2015) introduced two-factor and three-factor scaling methods for identification of the forced signal and ICV; these methods utilized not only the MMEM based on historical simulations incorporating all forcing factors but also the MMEMs from the historical simulations forced by the greenhouse gas or natural forcing only. In this comment, we will refer to the single-factor, two-factor, and three-factor estimates of the forced signal and the associated estimates of ICV as the MMEM1, MMEM2, and MMEM3 methods, respectively.

Recently, Kravtsov et al. (2015) and Kravtsov and Callicutt (2017) showed that Steinman et al.’s (2015a,b) arguments are flawed, and it is impossible to characterize a rich spectrum of forced signals realized in the individual CMIP5 models using the single rescaled MMEM signal. In particular, these authors demonstrated that subtraction of the rescaled MMEM from the ensembles of individual model simulations does not produce independent (uncorrelated) realizations of this model’s ICV, which are instead contaminated (and often dominated) by the common difference between this model’s true forced signal and the rescaled MMEM. They also showed that the smoothed (5-yr low-pass filtered) SMEM of an individual model provides a much more accurate estimate of this model’s true forced signal. Kravtsov and Callicutt (2017) further used the appropriately rescaled SMEMs of individual models in conjunction with the observed time series of a given climate index to estimate forced signal (and ICV) in observations. Meanwhile, Kravtsov (2017) developed a spectral variance-inflation method to correct for the errors associated with insufficient SMEM averaging and to produce unbiased estimates of ICV in CMIP5 historical simulations. We will hereafter jointly refer to the above methods for estimating the observed and simulated ICV over the historical period as the KC2017 method.

The above criticisms of the MMEM1 method also fully apply to a recent work by Cheung et al. (2017, hereafter C2017), who used it as the main tool to compare the observed and CMIP5-simulated ICV of surface temperature over the course of the twentieth century. These authors found considerable mismatches in the magnitude of the observed and modeled ICV on decadal and longer time scales, as well as mismatches in the ICV between CMIP5 historical runs and preindustrial (PI) control runs, with the simulated ICV over the historical period being generally much lower than the observed ICV, but higher than the variability in the PI control runs. They also detected pronounced differences between the patterns of the observed and simulated ICV so computed. The main point of this comment is to show that these results are quantitatively inaccurate due to biases of the MMEM1 method, whereas the KC2017 method provides means to a much more accurate estimation of both the amplitude and the phase of the true ICV in CMIP5 historical simulations.

Further presentation is organized as follows. The data and analysis methodology are described in section 2. Section 3 analyzes the performance of the various MMEM methods, as well as the KC2017 method in isolating ICV in the simulations from the Community Earth System Model (CESM) Large Ensemble Project (LENS: Kay et al. 2015; see http://www.cesm.ucar.edu/projects/community-projects/LENS/) in which a large ensemble size permits a fairly accurate determination of the true forced (and internal) variability. The application of these methods to CMIP5 multimodel ensemble is discussed in sections 4 and 5, and conclusions are presented in section 6. The supplemental material contains further information in support of this paper’s main results, as well as the link to the data and code for all of the analyses performed here.

## 2. Data and methods

We utilized the same historical (see Table 1), historicalNat (natural forcing only), historicalGHG (greenhouse gas forcing only), and PI control CMIP5 simulations (1880–2005) as in Kravtsov (2017). The subensemble of CMIP5 historical simulations (17 models and 111 simulations) was smaller than the ensemble used in C2017 since we only considered the models with four or more historical simulations available to be able to apply the KC2017 method; note, however, that the MMEM based on this subensemble is virtually indistinguishable from the MMEM based on the entire CMIP5 ensemble (not shown). We also utilized the 40-member historical ensemble (1920–2005) and long PI control simulations from the LENS project (Kay et al. 2015). The observed surface temperature data we used were also identical to the data used in Kravtsov (2017) (and C2017); one notable difference with C2017 here is our use of the Twentieth Century Reanalysis (20CR; Compo et al. 2011) to compute the Northern Hemisphere mean temperature (instead of GISTEMP data in C2017). Our results do not depend on the choice of surface temperature dataset.

Following Steinman et al. (2015a) and C2017, we considered the Atlantic multidecadal oscillation (AMO), Pacific multidecadal oscillation (PMO), and Northern Hemisphere multidecadal oscillation (NMO) indices; the AMO and PMO indices were computed as sea surface temperature averaged over the North Atlantic and North Pacific (0°–60°N), respectively, and the NMO was computed as a weighted sum of the 0°–60°N surface air temperature averaged over ocean and land, with the weights set to 0.61 and 0.39, respectively.

We computed forced signals and ICV estimates (in both observations and model simulations) using the MMEM1, MMEM2, MMEM3, and KC2017 methods. In applying the KC2017 method to LENS simulations, we considered random subensembles of five simulations (which is a typical size of an individual model ensemble in CMIP5 models) containing a given simulation to compute 5-yr low-pass-filtered ensemble mean as a first-guess estimate of the forced signal and the residual ICV in that simulation, and then applied this procedure to all other simulations. The resulting 40 estimates of ICV were further variance–bias corrected using the procedure developed in Kravtsov (2017). For the LENS ensemble, we also computed the “true forced signal” as the grand (40 member) ensemble mean, and the associated “true residual ICV” for each of the 40 historical realizations. As in C2017, we further computed low-pass-filtered versions of the AMO, PMO, and NMO ICV using the data-adaptive filter of Mann (2008) with cutoff frequencies of 0.1, 0.05, and 0.025 cycles per year (or corresponding smoothing time scales of 10, 20 and 40 yr).

## 3. Quantifying performance of different methods for isolating internal variability in historical simulations using CESM LENS ensemble

Figure 1 shows the magnitude of internal variability in LENS simulations estimated by different methods (top row), as well as the root-mean-square (rms) error of each method with respect to the “true internal variability” based on the subtraction of the grand ensemble mean from individual simulations (bottom row), as a function of smoothing time scale. It turns out that the rescaled CMIP5 MMEM is fairly close to the true forced signal in the CESM model, and the MMEM-based methods generally provide decent estimates of ICV in CESM historical runs (see Fig. S1 in the supplemental material). Still, the KC2017 method is more accurate than MMEM1 method in identifying low-frequency ICV, with the rms error of the 40-yr low-pass-filtered ICV estimated using the KC2017 method being 75% of the MMEM1 error for the PMO (Fig. 1b), and about 40% of the MMEM1 error for the AMO and NMO (Figs. 1d,f). The MMEM2 and MMEM3 methods perform worse than both the MMEM1 and KC2017 methods across the entire range of time scales and for all indices. The MMEM1 and MMEM2 methods tend to overestimate the magnitude of the true ICV at low frequencies, whereas the MMEM3 method slightly underestimates this magnitude (Figs. 1a,c,e). Note that the amplitudes of ICV estimated by all methods are overall consistent; hence, large rms errors of the MMEM2 and MMEM3 methods (Figs. 1b,d,f) are due to a combination of both amplitude and, most importantly, phase errors with respect to the true ICV (in other words, the magnitude of the estimated ICV may match the magnitude of the true ICV, but it may still be poorly correlated with the true ICV; see examples in Fig. S1). In summary, the KC2017 method provides the best estimate of the ICV in CESM LENS simulations, whereas the performance of MMEM-based methods strongly depends on how close a given type of MMEM is to the true forced signal of this model.

## 4. Internal variability in observations and CMIP5 simulations: Comparison of different estimation methods

The main problem with using MMEM-based methods for estimation of ICV in CMIP5 models, as in C2017, is illustrated in Fig. 2, which shows estimates of ICV in the five available historical simulations of GFDL CM3. Based on the results of section 3, we assume that the KC2017 method provides the best estimate of the ICV; Kravtsov et al. (2015) showed that this method indeed leads to the independent (uncorrelated) ICV estimates in different realizations of a given model. By contrast, all of the GFDL CM3’s realizations of ICV estimated by using MMEM1 and MMEM2 methods clearly have the same shape (and are thus strongly correlated) in the second half of the twentieth century, whereas the true forced signal of the GFDL CM3 is very different from its MMEM-based estimate (not shown). The internal variability estimated using these methods is thus contaminated (and in this case dominated) by the mismatch between the true forced signal of GFDL CM3 and the MMEM. The MMEM3 method does much better, and its estimated ICV is closer to the KC2017 estimates (because the MMEM3-based forced signal is closer to that estimated using the KC2017 method), but there are still considerable differences in the phase and amplitude of ICV estimated by the MMEM3 and KC2017 methods.

Figure 3 summarizes the performance of the four methods discussed in this paper across the entirety of the CMIP5 subensemble considered. We concentrate here on the results for the 40-yr low-pass-filtered ICV, as the results utilizing 10- and 20-yr smoothing time scales are qualitatively similar. The results for the PMO, AMO and NMO indices are also consistent. First off, irrespective of the method used to infer either observed or simulated ICV, the magnitude of multidecadal ICV in CMIP5 simulations is much lower than the observed magnitude, consistent with Frankcombe et al. (2015), Kravtsov and Callicutt (2017), Kravtsov (2017), and C2017. Among the observed estimates, the KC2017 method leads to the largest estimated magnitude of the ICV, and MMEM3 method to the smallest magnitude, while the magnitude estimated by the MMEM1 and MMEM2 methods is in between. For the simulated variability, the magnitude of the ICV inferred via MMEM1 and MMEM2 methods is the largest, and exceeds the magnitude of the variability in PI control runs by a factor of approximately 1.7–1.8 for the PMO and NMO indices and by a factor of 1.25 for the AMO index, consistent with C2017. Both the KC2017 and MMEM3 methods give magnitudes of simulated ICV consistent with the PI control runs, with the KC2017-based magnitude being slightly larger and the MMEM3-based magnitude slightly smaller than the magnitude in the PI control simulations.

The amplitude bias in the MMEM1- and MMEM2-based estimates of ICV in historical CMIP5 simulations stems from misidentification of the ICV illustrated in Fig. 2, as the individual models’ forced signals inferred using these methods are systematically different from the true forced signals in these models, which are best approximated by the KC2017 method. Of the three MMEM-based methods considered, the MMEM3 method provides the estimates of the forced signal and ICV that best match the KC2017 estimates (Figs. 3b,d,f), but even in this case, the mismatch between these two methods on multidecadal time scales (for the 40-yr low-pass-filtered data) is characterized by a large rms error on the order of the standard deviation of the simulated ICV (cf. brown bars in Figs. 3b,d,f with the green or brown bars in Figs. 3a,c,e, respectively).

In general, the MMEM-based methods are expected to provide faithful estimates of the true ICV in a given model if this model’s true forced signal—well approximated by the KC2017 method—is close to the corresponding scaled MMEMs, and poor estimates otherwise. We thus developed a similarity index to characterize the individual models in terms of how well their MMEM-based estimates of the forced signal and ICV match the corresponding KC2017 estimates. To do so, we utilized 40-yr low-pass-filtered MMEM1, MMEM2, MMEM3, and KC2017 estimates of ICV and computed the average rms error of the three MMEM-based methods with respect to the KC2017 method for individual model subensembles. For a given climate index (AMO, PMO, or NMO) and a given MMEM method (MMEM1, MMEM2, or MMEM3), this procedure resulted in 17 values of the rms error (one value per model). We then assigned the consistency index value of +1 to the models characterized by the lowest rms (below the 33rd percentile of the 17 rms values), the consistency index value of −1 to the models characterized by the highest rms (above the 66th percentile of the 17 rms values), and the index value of 0 to all other models. Finally, we averaged the resulting consistency index values across the estimates corresponding to the MMEM1, MMEM2, and MMEM3 methods, and then across the consistency index estimates for the AMO, PMO, and NMO time series. The resulting consistency index has a range between −1 and +1. The value of −1 corresponds to the situation when a given model is characterized by a high rms error with respect to the KC2017 method for all MMEM-based methods and for all three climate indices (hence, for such a model, the MMEM-based estimates of the forced signal and internal variability are poor). Conversely, the consistency index value of +1 would indicate that the corresponding model’s true forced signal is close to its MMEM-based estimate. We further clustered the models in three groups by selecting the models with consistency indices below −0.34, above +0.34, and in between, and assigning to these models the values of −1 (true forced signal inconsistent with MMEM), +1 (true forced signal consistent with MMEM), and 0 (intermediate group in terms of consistency between MMEM-based and KC2017-based forced signals).

The consistency index results (Table 1) indicate that MMEM-based forced signal estimation methods produce good estimates of the ICV in GISS models (except for GISS-E2-Rp3), as well as in MRI-CGCM3, poor estimates of ICV in the CanESM2, CSIRO Mk3.6.0, GFDL CM3, GISS-E2-Rp3, HadGEM2-ES, and IPSL-CM5A-LR models, and intermediate-quality estimates in CCSM4, CNRM-CM5, GFDL CM2.1, HadCM3, and MIROC5 models. Figure 2 gives an example of the ICV in a model from the “poor” group; see Figs. S2 and S3 in the supplemental material for the analogous examples from the “good” and “intermediate” groups. Thus, CMIP5 models are characterized by a wide variety of forced signals, which MMEM-based estimation methods fail to capture (cf. Kravtsov et al. 2015; Kravtsov and Callicutt 2017; Kravtsov 2017). Therefore, application of these methods (and especially MMEM1 method) to characterize ICV in historical CMIP5 simulations, as was done in C2017, is not appropriate.

## 5. Patterns of the observed and simulated ICV

Given the large differences between the 40-yr low-pass-filtered ICV time series computed using C2017’s MMEM1 method and the KC2017 method (see, e.g., Fig. 2 for the GFDL CM3), it is reasonable to expect that the associated spatial patterns will also be different. Indeed, the latter differences between the GFDL CM3’s AMO patterns derived via the two methods are substantial (see Fig. 4); the analogous results for the PMO and NMO patterns are shown in Figs. S4 and S5 of the supplemental material. We computed the spatial patterns in Fig. 4 in two ways, by correlating the AMO time series with either raw SST data in each simulation (as apparently was done in C2017) (see Figs. 4a,c herein), or, alternatively, with SST anomalies formed by subtracting the SMEM-based or MMEM1-based forced signals at each model grid point (Figs. 4b,d). Note that the resulting correlation maps are not identical for either the SMEM (Figs. 4a,b) or the MMEM1 method (Figs. 4c,d). These differences arise because the spatial patterns of the forced response and internal variability in the CMIP5 models are not orthogonal; it is, therefore, appropriate to subtract the forced signal at each grid point prior to computing the correlation maps to faithfully characterize the spatial pattern of ICV in CMIP5 historical simulations.

The differences between the ICV spatial patterns estimated using the MMEM1 and KC2017 methods are most pronounced for the 40-yr low-pass-filtered data and taper off as the smoothing time scale decreases, essentially disappearing in the raw (unfiltered) data (Fig. S6 in the supplemental material). This is to be expected, as the maps for the unfiltered data are dominated by correlations between high-frequency components of ICV in the gridded SST and each climate index considered, whereas the estimated forced signals are much smoother irrespective of the estimation method. The differences between the patterns computed using all methods are also alleviated in the multimodel aggregates of correlation maps, even for the 40-yr low-pass-filtered indices (see Fig. 5 for the AMO pattern).

The latter property becomes most apparent in the analysis of similarity between the (AMO, PMO, and NMO) spatial patterns of ICV estimated by C2017’s MMEM1 method and the KC2017 method (Fig. 6): the patterns of individual models are generally much more different than the multimodel ensemble’s aggregate patterns. Hence, characterizing the differences between the MMEM1- and KC2017-based estimates of ICV patterns using the multimodel mean correlation maps, without reference to model uncertainty, would be misleading, much in the same way as using the MMEM as the single universal forced signal estimate for the entire CMIP5 ensemble is misleading without consideration of the differences in the forced signals between individual CMIP5 models (cf. Steinman et al. 2015a; Kravtsov and Callicutt 2017).

## 6. Conclusions

We compared the performance of four methods for estimating internal climate variability (ICV) in historical simulations of CMIP5 climate models. The methods considered included a suite of methods based on subtraction of scaled multimodel ensemble means (MMEM) from individual model simulations or observations (the MMEM1, MMEM2, and MMEM3 methods; Steinman et al. 2015a,b, Frankcombe et al. 2015; C2017), as well as the KC2017 method (Kravtsov et al. 2015; Kravtsov and Callicutt 2017; Kravtsov 2017), which works with smoothed ensemble means of individual models. We found the following:

The KC2017 method provides the best estimates of the low-frequency (10-yr+) ICV in historical simulations of individual models, and results in independent (uncorrelated) time series of ICV in individual simulations of a given model. The magnitude of ICV so computed is consistent with that of the variability in the PI control runs.

The performance of MMEM-based methods depends on whether the scaled MMEM signal is close to this model’s true forced signal or not. In the latter case, the estimated ICV is contaminated, and often dominated, by the difference between the true forced signal and its MMEM-based estimate, resulting in large phase and amplitude errors of the ICV so inferred. The resulting estimates of the ICV in individual simulations of a given model are not independent (correlated). For MMEM1 method, the magnitude of the estimated ICV is larger than that in the PI control runs.

Out of the 17 CMIP5 models considered here, only six models are characterized by the forced signals well represented by the MMEM methods. For the rest of the models, the differences between the scaled MMEMs and their true forced signal estimates obtained by the KC2017 method are large, and the resulting MMEM-based estimates of the ICV are strongly biased as per item 2 above. Hence, the ICV in the ensemble of historical CMIP5 simulations cannot be reliably determined using MMEM-based methods.

The spatial patterns (correlation maps) of ICV estimated using the MMEM1 and KC2017 methods in individual models are, in general, very different. These differences are muted in the multimodel ensemble-mean correlation maps.

Out of the three MMEM-based methods applied to the CMIP5 ensemble, the three-factor scaling method (MMEM3) performs, on average, better than the one- and two-factor methods (MMEM1 and MMEM2, respectively) but is still characterized by large errors relative to the KC2017 method.

All methods indicate that the estimated observed ICV at decadal and longer time scales has a much larger magnitude than the CMIP5-simulated ICV. The disparity between the observed and simulated ICV is most pronounced in the KC2017 estimates of this variability.

Our results demonstrate that application of the MMEM1 method to infer ICV in historical CMIP5 simulations in C2017 work is inadequate, and leads to large phase and amplitude errors of ICV so estimated, because of contamination of the true simulated ICV by forced-signal biases of individual models with respect to the MMEM.

## Acknowledgments

We acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups for producing and making available their model output. This research was supported by the NSF Grant AGS-1408897. All data and MATLAB scripts for this paper are available for downloading using the link listed in the supplemental material.

## REFERENCES

## Footnotes

Supplemental information related to this paper is available at the Journals Online website: https://doi.org/10.1175/JCLI-D-17-0438.s1.

The original article that was the subject of this comment/reply can be found at http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-16-0712.1.

© 2017 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy (www.ametsoc.org/PUBSReuseLicenses).