## Abstract

The problem of separating variations due to natural and anthropogenic forcing from those due to unforced internal dynamics during the twentieth century is addressed using state-of-the-art climate simulations and observations. An unforced internal component that varies on multidecadal time scales is identified by a new statistical method that maximizes integral time scale. This component, called the internal multidecadal pattern (IMP), is stochastic and hence does not contribute to trends on long time scales; however, it can contribute significantly to short-term trends. Observational estimates indicate that the trend in the spatially averaged “well observed” sea surface temperature (SST) due to the forced component has an approximately constant value of 0.1 K decade^{−1}, while the IMP can contribute about ±0.08 K decade^{−1} for a 30-yr trend. The warming and cooling of the IMP matches that of the Atlantic multidecadal oscillation and is of sufficient amplitude to explain the acceleration in warming during 1977–2008 as compared to 1946–77, despite the forced component increasing at the same rate during these two periods. The amplitude and time scale of the IMP are such that its contribution to the trend dominates that of the forced component on time scales shorter than 16 yr, implying that the lack of warming trend during the past 10 yr is not statistically significant. Furthermore, since the IMP varies naturally on multidecadal time scales, it is potentially predictable on decadal time scales, providing a scientific rationale for decadal predictions. While the IMP can contribute significantly to trends for periods of 30 yr or shorter, it cannot account for the 0.8°C warming that has been observed in the twentieth-century spatially averaged SST.

## 1. Introduction

It is well established that the global mean surface temperature has risen by more than 0.7°C over the last 100 yr (Trenberth et al. 2007). Since global warming has been linked to rising sea levels (Bindoff et al. 2007), glacier melting (Lemke et al. 2007), Arctic sea ice retreat (Lemke et al. 2007), increasing tropical cyclone intensity (Knutson et al. 2010), and diminished snow cover (Lemke et al. 2007), the cause of this warming is of obvious concern. Many studies conclude that human activities are primarily responsible for this warming (Hegerl et al. 2007). However, the observed warming does not occur uniformly in time. For instance, the rate of global warming during 1901–2005 is about 0.075°C decade^{−1}, whereas the rate during 1981–2005 is about 0.23°C decade^{−1} (Trenberth et al. 2007). Also, observations indicate little to no warming during 1950–70 and 1998–2007, despite increasing greenhouse gas concentrations (Trenberth et al. 2007).

While the reality of human-induced global warming is beyond doubt, a question of intense interest is whether the recent acceleration in warming and the midcentury cooling are due to internal variability or changes in natural and anthropogenic forcing (including volcanic and solar forcing). By internal variability we mean variability that occurs in the absence of natural or anthropogenic forcing; that is, variability that occurs solely due to the internal dynamics of the coupled atmosphere–ocean–biosphere–cryosphere system. The purpose of this paper is to quantify the degree to which observed multidecadal fluctuations of spatially averaged sea surface temperature during the past century can be separated into distinct internal and forced components. While this topic has been the focus of numerous studies (Zwiers and Zhang 2003; Huntingford et al. 2006; Stone et al. 2007; Hegerl et al. 2007), the present work explicitly identifies a significant unforced multidecadal component and separates this component from forced components using optimal spatial filtering techniques.

Detection of climate change and its attribution to external forcings requires first defining the space–time structure of the expected response of the climate system to external forcing. These forced response patterns typically are obtained from coupled atmosphere–ocean general circulation models. Because realistic climate models generate their own internal variability, identification of the forced response pattern from model simulations involves yet another signal detection problem. A typical approach is to estimate the response pattern by averaging over space, time, ensembles, or by calculating leading principal components (PCs) of forced simulations. However, none of these approaches optimizes detectability. In this paper, discriminant analysis is used to construct a forced response pattern that maximizes the ratio of forced variance to internal variability. A forced pattern estimated this way optimizes detection in the forced climate models and hence is likely to be detectable in observations.

Another step in detection and attribution analysis is the determination of the statistical properties of internal variability. Much of the debate on global warming centers on uncertainties in the structure and magnitude of the internal variability of the real climate system. In practice, these statistical properties are estimated from climate simulations without natural or anthropogenic forcing, called control runs. After defining the forced response pattern and the statistical characteristics of internal variability, both to within unknown coefficients, generalized multivariate linear regression is then used to estimate the coefficients so as to best fit the observed record. A forced response is detected when the change in coefficient is unlikely to have occurred because of natural variability, and is attributed when the change is consistent with the climate model predictions for that response and is inconsistent with the predicted response to other plausible forcings (Hasselmann 1979, 1997; Allen and Tett 1999).

In this paper, we expand the standard detection and attribution framework by including a pattern of internal variability among the forced response patterns being investigated. This approach allows a more complete diagnosis of the role of unforced components in observed variability. The value of this approach depends on how well the internal pattern explains the variability in question. Estimation of the amplitude of internal variability proceeds in the same fashion as in standard detection analysis. Moreover, the ability to distinguish between forced and internal variability depends on the extent to which these patterns differ. However, detection and attribution are not relevant concepts for components that arise from internal variability. Instead, the concepts of skill and fidelity become relevant for internal components: skill measures the degree to which predictions of a component match observations of the component and fidelity measures the degree to which the observed statistical properties of a component match those predicted by climate models. Methods for estimating skill and fidelity are discussed in Jolliffe and Stephenson (2003) and DelSole and Shukla (2010).

## 2. Identification of internal multidecadal patterns

We are interested in diagnosing internal variability on decadal-to-multidecadal time scales. Unfortunately, standard statistical procedures, such as principal component analysis, do not decompose variables specifically by time scale. Empirical mode decomposition (Huang and Wu 2008) and singular spectrum analysis (Ghil et al. 2002) ignore spatial correlations and hence are not optimal. Multichannel singular spectrum analysis and extended empirical orthogonal functions (Ghil et al. 2002) are often used to decompose time series by time scale, but they are not specifically optimized for this purpose. Here we employ a novel statistical method that decomposes variables by time scale, where time scale is measured by average predictability time (APT). APT is a measure of predictability that can be interpreted as a multivariate generalization of the integral time scale, *T*_{2},

where *ρ _{τ}* is the autocorrelation function of the process and

*τ*is time lag (DelSole and Tippett 2009a). DelSole and Tippett (2009b; see their appendix C) show that any multivariate time series can be decomposed into an uncorrelated set of components ordered such that the first maximizes APT, the second maximizes APT subject to being uncorrelated with the first, and so on.

Unfortunately, multidecadal variability tends to be model dependent (Latif et al. 2006). To reduce this model dependence, we adopt a multimodel approach in which APT is optimized over multiple models. A potential problem with this approach is that the leading component may arise from a single dominant model or subset of models. To confirm that the leading APT component genuinely reflects a property of the entire multimodel ensemble, we verify that the component is predictable on decadal time scales in each model separately.

We examine control runs that were assessed in the Intergovernmental Panel on Climate Change (IPCC) Fourth Assessment Report [also known as Coupled Model Intercomparison Project phase 3 (CMIP3)]. Only runs that are at least 300 yr long and have consistent variances are considered (see the second section of the appendix for details). Using only control runs ensures that the obtained patterns are due to internal variability and not to natural or anthropogenic forcing. The component with maximum APT in these control runs is shown in the top panel of Fig. 1. To facilitate comparison with observations, APT is optimized only using “well observed” ocean model points, where the term *well observed* depends on the sampling characteristics of an observational dataset (see the first section of the appendix for details). The pattern is predominantly of single sign and concentrated in the North Atlantic and North Pacific. The centered time series in three representative control runs, shown in the bottom panel of Fig. 1, confirm that the component fluctuates significantly on multidecadal time scales. The APT for this component is 5.2 yr. Note that APT quantifies predictability time scale but not oscillatory time scale. Loosely speaking, oscillatory time scale depends on the *location* of a spectral peak, while APT depends on the *width* of a spectral peak (DelSole and Tippett 2009a). Statistical significance of APT is assessed relative to the null hypothesis that the time series is white noise when sampled every 2 yr. The motivation for this hypothesis and the method for estimating the corresponding sampling distribution are discussed in the third section of the appendix. The APT of this component is found to be statistically significant in each model individually. We refer to this component as the internal multidecadal pattern (IMP).

The space–time structure of the IMP is suggestive of the multidecadal variability identified in previous studies (Bjerknes 1964; Schlesinger and Ramankutty 1994; Kushnir 1994; Delworth and Mann 2000). Consistent with some of these studies, the IMP identified here has no significant correlation with concurrent atmospheric surface winds, surface pressure, or precipitation at any grid cell. The lack of correlation between major atmospheric coupling variables, and the concentration of amplitude in regions associated with deep-water formation, suggests that the multidecadal variations in the IMP arise from internal ocean dynamics, either as a self-sustained phenomenon or driven stochastically by the atmosphere.

Although the IMP has amplitude in two oceanic basins, the optimization procedure does not distinguish between cause and response, hence we cannot exclude the possibility that decadal variability arises in one basin and that the signal in other basins emerges as a response. A related concern is that optimizing multimodel APT may result in a component that is a mixture of distinct phenomena in different models. For instance, one could imagine the Atlantic and Pacific structure of the IMP as resulting from some of the models having predictability in the Atlantic and others in the Pacific, but with none having related predictability in both. Finally, the signal in both basins might be a statistical artifact caused by representing variability with an incomplete set of EOFs. To gain insight into these questions, we optimized APT only in the Atlantic basin and then calculated regression coefficients between the component and the SST at individual grid points. We found that optimizing APT only in the Atlantic yielded nearly the same structure in the Atlantic as seen in Fig. 1. Moreover, outside the Atlantic, the regression map captured a similar positive relation in the North Pacific. Large-scale positive regression coefficients can be found in the Pacific, to varying degrees, in about half the models individually. We also optimized APT only in the Pacific and obtained a pattern of the same sign globally, with positive regression coefficients in the Atlantic in about half the models individually. These calculations show that the Atlantic and Pacific signals are genuine covarying structures and not statistical artifacts of the technique.

## 3. Identification of the forced response

To specify the response to climate forcing, we seek a pattern that describes the change in spatial structure due to natural and anthropogenic forcing, but also filters out as much internal variability as possible. To do this, we determine the pattern that maximizes the ratio of variance in the forced run to variance in the control run (details given in the fourth section of the appendix). The mathematical technique for doing this is called discriminant analysis and has been used previously to quantify seasonal predictability (Straus et al. 2003), decadal predictability (Venzke et al. 1999), and climate change (Ting et al. 2009; Schneider and Held 2001). Our application differs from Ting et al. (2009) in that we use the method to discriminate between forced and control runs, with no ensemble averaging, whereas Ting et al. (2009) use only forced runs to discriminate between ensemble means and their deviations. To the extent that the forced response is additive and independent of internal variability, the two methods should give identical results. In practice, the two methods may give different results. Our approach takes advantage of the much larger sample sizes afforded by the control runs.

In contrast to many attribution studies, no time lag information is used to describe the response to external forcing—the discriminant analysis is based only on spatial structure.

The pattern that maximizes the ratio of variances between forced and control runs is shown in Fig. 2. Variance in each control run is measured with respect to the 300-yr mean of the control run, while variance in each forced run is measured with respect to the 1901–50 mean of the forced run. The discriminant pattern in Fig. 2 is similar to that obtained in Ting et al. (2009), indicating that sampling errors are not significant. We call this pattern the forced-to-unforced discriminant. In contrast to the IMP (Fig. 1), the discriminant pattern has positive anomalies in the tropics and weak or negative anomalies in the extratropics. These differences provide the basis for separating forced and internal variability.

Discriminant analysis produces an ordered set of patterns, such that the first maximizes the ratio of forced-to-unforced variance, the second maximizes this ratio subject to being uncorrelated with the first, and so on. The question arises as to whether some secondary patterns should be included to capture more fully the response to climate forcing. Figure 3 shows the variance ratios for all discriminant patterns and reveals that only the first is clearly separated from the others. The figure also shows the upper and lower fifth percentiles of the variance ratios determined from 1000 bootstrap samples of the control simulations, using a block length of 80 yr (a large block length was chosen to capture autocorrelations and trends). The fact that only the first ratio lies well outside the bootstrap confidence interval implies that the other ratios are consistent with the hypothesis of no forced response. Presumably, only one forced pattern emerges because the spatial response to different climate forcings (e.g., fossil fuel burning, volcanic eruptions, and solar variability) tends to project on similar surface patterns. Consistent with this, most attribution studies distinguish different forcings by including—in addition to horizontal spatial information—temporal information, vertical structure, or seasonality in the response pattern (Hegerl et al. 2007). While ignoring temporal information limits our ability to distinguish different forcings, it also creates opportunities for performing detection and attribution on finer temporal scales. In particular, signals with both space and time information can be monitored only if the data are at least as long as the time interval used to define the signal. In contrast, signals with only spatial information allow one to apply detection and attribution each year at a time, to monitor the signals in real time, and to detect abrupt changes in real time.

## 4. Results of fingerprinting

We now fit the well-observed annual average SST at each year to a linear combination of the forced-to-unforced discriminant and the IMP. Amplitudes are chosen to approximate observations as closely as possible, where “close” is defined by a generalized distance measure that accounts for correlations in space. This procedure is equivalent to fingerprinting (Hasselmann 1979, 1997; Allen and Tett 1999; Hegerl et al. 2007) and is discussed in the fifth section of the appendix. In contrast to previous studies that have used fingerprinting to distinguish between different forcings (e.g., anthropogenic and natural), fingerprinting is used here to distinguish between forced and internal variability, and this discrimination is based solely on spatial structure (i.e., no temporal information is included in the response pattern). The value of including an unforced component among the forced components lies in the fact that the IMP is 1) the most predictable structure on decadal time scales in state-of-the-art climate models and 2) of single sign and hence projects strongly on the global average. For brevity, we refer to the amplitude of the IMP projected onto the well-observed SST as the “observed IMP.”

The amplitude of the forced component, expressed as a 95% confidence interval, is indicated by shading in the top panel of Fig. 4. The confidence interval accounts for uncertainty due to finite sample size and missing observations (see the sixth section of the appendix for details). The amplitude of the forced component is dominated by a secular trend, but it also decreases briefly after certain major volcanic eruptions. These decreases are consistent with the fact that explosive volcanic eruptions increase sulfate aerosols in the stratosphere, which in turn lead to temporary global cooling (Forster et al. 2007). These evolutionary features support the claim that this component captures the response to both anthropogenic and natural forcing.

The expected amplitude of the forced pattern is estimated by averaging the amplitude of the forced-to-unforced discriminant across the forced runs. The resulting amplitude, shown as the blue line in the top panel of Fig. 4, is smoother than the observed counterpart because fluctuations arising from internal variability are filtered out by averaging over many forced runs. The question arises as to whether the observed and predicted amplitudes agree with each other, as required for a formal attribution analysis. On a year-by-year basis, an attribution analysis is equivalent to checking that the predicted amplitude occurs within the 95% confidence interval of the observed amplitude—that is, checking that the blue curve in Fig. 4 lies within the shaded region. This condition is satisfied for most years (by definition, observations are expected to fall outside the shaded region approximately 5% of the time). In contrast, detection—that is, determining that the observed amplitude is significantly different from zero—is equivalent to checking that the 95% confidence interval for the observed amplitude does not contain zero—that is, checking that the shaded region does not intersect the zero line. This condition is satisfied for every year after 1968. Thus, we conclude that the warming that has been observed since the 1970s is very unlikely to have occurred because of internal variability and is consistent with the warming because of anthropogenic and natural forcing predicted by models, which is consistent with the major conclusions of the IPCC (Hegerl et al. 2007). Note that our methodology has allowed us to draw this conclusion without taking multiyear averages, as typically performed in detection and attribution analyses (e.g., Huntingford et al. 2006; Zwiers and Zhang 2003; Stone et al. 2007).

The amplitude of the IMP, expressed as a 68% confidence interval, is indicated by shading in the bottom panel of Fig. 4. A smaller confidence interval is used because we are comparing amplitudes to each other, not to a specific value (e.g., an approximate 5% significance test for a difference in random variables would involve checking that the difference exceeds twice the standard deviation, or equivalently that one standard deviation “error bars” for the two values do not overlap). A striking aspect of this time series is the multidecadal oscillations in the last 100 yr. The negative anomalies during 1900–20 and 1970–90, and the positive anomalies during 1930–60, closely follow the multidecadal variability identified in the Atlantic Ocean by previous studies (Schlesinger and Ramankutty 1994; Kushnir 1994). Also, the observed IMP is strongly correlated with the annual average Atlantic multidecadal oscillation (AMO) index, shown as the red curve in the bottom panel of Fig. 4. These results suggest that the AMO represents variability that is dominated by internal dynamics, as suggested in previous studies (Knight et al. 2005; Zhang et al. 2007; Ting et al. 2009).

Since neither the control nor the forced runs used initial states based on observations, the random nature of internal variability means that we do not expect an internal component in the model to predict the time evolution of the component in observations—that is, we do not expect predictions of an internal component by these models to have skill. Consequently, comparison between model and observations is confined to verifying fidelity—that is, to verifying that the statistical characteristics of the component are consistent between observations and models. Two measures of fidelity are time scale and variance. To quantify time scale, we use a sample estimate of the integral time scale (1) (see the third section of the appendix for further details). For the observed IMP time series, we find *T*_{2} = 5.7 yr. By comparison, the mean and standard deviation of *T*_{2} across all forced runs are 6.6 and 3.7 yr, respectively. Thus, the time scale of the observed IMP is within the range of time scales predicted by models. Similarly, the variance of the observed IMP is 1.65, while the mean and standard deviation of the IMP variance across all forced runs are 1 and 0.75, respectively. Thus, the variance of the observed IMP is consistent with the range of variances predicted by models.

The squared autocorrelation function of the IMP in each control run is shown in Fig. 5. Also shown is the 5% significance level of the autocorrelation based on a sample size of 300. The significance level does not account for the selection bias because of choosing the component that maximizes APT; however, this bias is small owing to the large sample size of the multimodel dataset (over 4200 samples) and as confirmed by splitting the data in half and comparing the autocorrelations to those calculated from independent control runs. The *e*-folding times have a mean of 7.7 yr and a standard deviation of 3.5 yr; however, several models have significant autocorrelations after 10 yr, implying that the IMP in these models can be predicted by at least a linear model on decadal time scales. These results provide a scientific rationale for decadal prediction.

## 5. Forcing of internal multidecadal variability

Two major questions arise at this point: 1) does fingerprinting truly separate forced and unforced variability in observations and 2) does natural and anthropogenic forcing of the twentieth century influence the evolution of the IMP? Both questions can be answered by checking for the existence of a common signal in the IMP time series of the forced runs. To detect this signal, we test whether the ensemble mean IMP varies in time. For consistency, the IMP in the forced runs is calculated the same way as the observed IMP, including using the same forced-to-unforced discriminant. The mean and 95% confidence intervals for the IMP are shown in Fig. 6 (see the seventh section of the appendix for calculation details). About one-fifth of the confidence intervals fail to bracket zero, whereas only one-twentieth should fail to bracket zero if there were no signal. Thus, we cannot conclude that the forcing does not influence the IMP. Nevertheless, there are three points to recognize about this conclusion. First, the ensemble mean IMP has a standard deviation of 0.24, whereas the observed IMP has a standard deviation of 1.65. Thus, to the extent forcing influences the IMP, this influence explains only about one-seventh of the observed variability in IMP. Second, the ensemble mean IMP generally decreases between 1980 and 2000, whereas the observed IMP generally increases during this period. Also, the ensemble mean IMP has virtually no trend between 1950 and 1970, whereas the observed IMP generally decreases during this period. Thus, even if the forcing influenced the IMP, this influence cannot account for the general increase in the observed IMP during 1980–2000 and the general decrease between 1950 and 1970. Finally, the ensemble mean IMP is autocorrelated, hence it is not appropriate to apply the significance test for the mean to each year independently. An effective time scale estimated by fitting the *ensemble mean* IMP to a first-order autoregressive (AR) model is 7.7 yr (see the seventh section of the appendix for details of this estimate). To the extent that this time scale reduces the degrees of freedom by a factor of 7.7, the error bars in Fig. 6 would more than double, in which case the ensemble mean IMP in the forced runs would be indistinguishable from zero.

## 6. Analysis of spatially averaged SST

We now investigate the role of forced and internal variability in the spatially averaged SST. To do this, we consider three annual average SST datasets: 1) the observed SST, 2) the SST as reconstructed from the IMP and the forced component, and 3) the SST as reconstructed from the forced component only. The latter two datasets are sampled consistently with the missing data distribution of the first. The area-weighted spatial average of each SST dataset is shown as colored dots in Fig. 7.

First note an apparent discontinuity at 1945. This discontinuity has been noted previously and compellingly attributed to uncorrected instrumental biases during the early 1940s (Barnett 1984; Thompson et al. 2008). Consequently, we exclude the early 1940s data from our analysis. Dividing the post-1945 era into two equal periods yields the two 32-yr periods 1946–77 and 1977–2008. The trends during these 32-yr periods for the three datasets are shown as solid lines in Fig. 7 and tabulated in Table 1; the trend for the whole 63-yr period 1946–2008 is also shown, but it is offset by 0.4 K for clarity; 95% confidence intervals are also include in the table.

Second, the trend for the observed SST and for the reconstruction based on IMP plus forced component are statistically indistinguishable (i.e., their confidence intervals overlap). This agreement demonstrates that the two components capture the dominant multidecadal fluctuations in the observed spatially average SST.

Third, the trend for the observed SST is larger in the 1977–2008 period than in the 1946–77 period, indicating “accelerated warming.” The change in trend is significant under the assumption that the variability that remains after removing the trend is not autocorrelated. In reality, the residual variability is autocorrelated because of the IMP and nonlinear temporal forcing, raising the possibility that the two trends might not be distinguishable when autocorrelations in the residuals are taken into account. In fact, this difference can be explained by the IMP. Specifically, the trend due to only the forced component (i.e., the red line) is statistically the same in the two 32-yr periods and in the 63-yr period; that is, the forced part is not accelerating. Taken together, these results imply that the observed trend differs between the periods 1946–77 and 1977–2008, not because the forced response accelerated, but because internal variability leads to relative cooling in the earlier period and relative warming in the later period.

The contribution to trends due to the IMP can be understood from a more general framework. Since the IMP is stochastic, it can contribute only random trends. The distribution of these trends can be derived analytically from the statistics of the stochastic process (see the distribution of trends for a stochastic process in the appendix for details). If the IMP is fitted to a first-order autoregressive model based on its 1-yr lag autocorrelation of 0.806, then the 95% confidence interval for the IMP varies with trend period length as shown in Fig. 8. Note that the confidence interval increases rapidly with decreasing trend period length. For reference, the confidence interval is ±0.169 K decade^{−1} for 16-yr trends, ±0.0776 K decade^{−1} for 32-yr trends, and ±0.031 for 64-yr trends. By comparison, the trend for the forced pattern is about 0.1 K decade^{−1}, which is close to the confidence interval for the IMP trend for 25-yr periods. On 10-yr time scales, variability in trend due to the IMP is relatively large (e.g., ±0.265 K decade^{−1}) and can easily overwhelm the trend due to the forced component, although variability due to interannual variability, such as El Niño, also becomes important on this time scale. This framework provides a consistent and plausible explanation for why trends vary so strongly on 10-yr time scales and indicates that the lack of warming trend during the past 11 yr (1998–2008) is not sufficient to conclude that the long-term rate of warming has changed.

Consistent with the above results, all three datasets shown in Fig. 7 have statistically indistinguishable trends for the 63-yr period 1946–2008, indicating that internal variability can be filtered out by fitting trends over 60 or more years. Thus, in addition to optimizing forced-to-unforced variance, the forced response can be estimated from the

pattern of linear trends in the observations between 1850 and 2005,

pattern of linear trends in the forced runs between 1850 and 2000,

leading EOF of the ensemble mean forced runs between 1850 and 2000, and

leading signal-to-noise discriminant of the forced runs between 1850 and 2000.

The last pattern, proposed by Ting et al. (2009), maximizes the ratio of the variance of ensemble means to the variance of the their deviations. All four patterns are shown in Fig. 9. The patterns generally agree on a cooling pattern in the North Atlantic, warming in the tropics, and little to no warming in the central North Pacific. The result of fitting the observed SST to each of these patterns in turn, simultaneously including the IMP, is shown in Fig. 10. We see that the different time series differ only in minor ways on short time scales; that is, the time series and confidence intervals are not sensitive to the method by which the forced pattern has been estimated. This result shows that using discriminant analysis to identify forced patterns does not enhance detectability, contrary to expectation. However, the use of spatial optimization techniques may become more critical for smaller geographic domains or for smaller sample sizes. Note also that the trend pattern from observations does not involve a dynamical model and hence is not affected by uncertainties in climate forcing.

It should be emphasized that while the IMP can contribute significantly to trends on time scales of around 30 yr or shorter, it cannot account for the century-long 0.8°C warming trend observed in the spatially averaged sea surface temperature.

## Acknowledgments

We thank the anonymous reviewers for their thoughtful and extensive comments, which lead to significant improvements in the methodology and presentation of this work. This research was supported by the National Science Foundation (Grants ATM0332910, ATM0830062, and ATM0830068), National Aeronautics and Space Administration (Grants NNG04GG46G and NNX09AN50G), the National Oceanic and Atmospheric Administration (Grants NA04OAR4310034, NA09OAR4310058, and NA05OAR4311004), and the U.S. Department of Energy. The views expressed herein are those of the authors and do not necessarily reflect the views of these agencies.

## REFERENCES

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**(

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

**,**

### APPENDIX

#### Calculation Details

##### a. Datasets

The observational dataset used in this study is the Met Office Hadley Centre Sea Surface Temperature, version 2 (HadSST2; available online at http://badc.nerc.ac.uk/data/hadsst2/) dataset. This dataset is an estimate of SST from 1850 to the present, averaged on a monthly basis on a 5° × 5° grid. Grid cells with insufficient data are assigned “missing values.” See Rayner et al. (2006) for further details of this dataset.

The model simulations analyzed in this study are the World Climate Research Programme (WCRP) CMIP3 multimodel dataset. These simulations were assessed in the Fourth Assessment Report of the IPCC. The two scenarios analyzed are 1) radiative forcing characteristic of preindustrial times, called “control runs” or “unforced runs” (officially designated as “PICNTRL”) and 2) anthropogenic and natural forcing characteristic of the twentieth century, called “forced runs” [officially designated as Twentieth-Century Climate in Coupled Model (“20c3m”)]. The specific models used in this study are listed in Table 2. Some models did not include solar and volcanic forcing in the forced runs, as indicated in Table 2, so a multimodel average of the forced runs probably underestimates the response to natural forcing. The consistency of the final results with observations suggests that this underestimation is not serious.

The output of each simulation was interpolated onto the 5° × 5° grid of the HadSST2 dataset to facilitate comparisons on a common grid. The time series at each ocean grid cell in each dataset was averaged from January to December of each year to obtain 12-month averages. For the HadSST2 dataset, only years having 10 or more months of nonmissing values were averaged; otherwise, the grid cell was assigned a missing value for that year.

To facilitate comparison with observations, we include a grid cell in the analysis only if 85% of the years between 1950 and 2005 were available in the HadSST2 dataset; otherwise, the grid cell was omitted from analysis. This “masking” procedure yielded a map containing 688 ocean points. All model simulations were masked in this way for consistency.

The AMO index used in Fig. 4 was downloaded (from http://www.cdc.noaa.gov/data/timeseries/AMO/). The AMO is defined as the detrended, area-weighted-average SST over the North Atlantic from 0° to 70°N. The index was annually averaged and scaled to fit in the figure.

##### b. Multimodel EOFs

Statistical optimization methods produce biased estimates because of overfitting when the number of estimated parameters is not a small fraction of the total sample size. To reduce the number of estimated parameters, we project variables onto the leading PCs of the ensemble of control runs. Accordingly, we collected all control runs with at least 300 yr of simulation, which totaled 17 models. To avoid biasing the results toward models with longer runs or multiple ensemble members, we use only the last 300 yr of the single longest control run of each climate model simulation. The 300-yr mean of each simulation is then subtracted from each grid point of the corresponding control run to produce anomalies of annual mean fields. For reasons explained below, the Institute of Atmospheric Physics (IAP) model, Goddard Institute for Space Studies Atmosphere Model E-H (GISS-EH), and GISS Model E-R (GISS-ER) were removed from analysis. This procedure yielded 300-yr anomaly time series from 14 separate climate models, giving a total of 4200 yr of simulation. The specific models used in our analysis are the first 14 models listed in Table 2.

The complete set of control runs obtained from the above procedure yields a 688 × 4200 data matrix. The principal components were computed by multiplying the anomaly at each grid point by the square root of the cosine of latitude, computing the singular value decomposition (SVD) of the resulting data matrix, and then inverting the cosine weighting in the spatial fields. The cosine weighting ensures that the SVD maximizes the area-weighted variance.

After computing the multimodel PC, the variance of a given PC was computed separately for each model. A standard chi-square test with 299 (=300 − 1) degrees of freedom indicates that the 95% confidence interval for each variance is about 18% of the variance. This interval underestimates the true interval since the time series are autocorrelated; nevertheless, it provides a useful benchmark. When all 17 control simulations of length 300 yr are included, the IAP model was found to have several times more variance than any other model in each of the first five PCs, suggesting that this model has significantly different space–time variability than other models and hence should not be pooled with the others. Accordingly, the IAP model was dropped from the analysis and the PCs recomputed. The newly computed PCs revealed that GISS-EH had more than twice as much variance and that GISS-ER had less than half as much variance as any other models. Eliminating these two models and recomputing the PCs revealed that the Centre National de Recherches Météorologiques (CNRM) model had 40% more variance than any other model, but otherwise the remaining control runs revealed no clear outlier models in terms of variance (i.e., the 95% confidence interval for each model overlapped with at least one other model). Given the underestimation of the confidence interval, and the multiple comparisons involved, we decided not to exclude the CNRM model from our final set of models. This procedure eliminated 3 out of 17 models, giving a total multimodel ensemble of 14 models, each of length 300 yr.

It turns out that the IAP model and GISS-EH and GISS-ER also have significant trends in the control runs. APT analysis is sensitive to such trends and including these models in the analysis produced results that were dominated by these models. Thus, these models are “outliers,” not only in terms of variance, but also in their multidecadal variability.

##### c. IMP

We employ a novel procedure for identifying characteristic patterns of internal multidecadal variability. Complete details can be found in DelSole and Tippett (2009a,b). Briefly, the method is to optimize APT, which is defined as the integral over lead time of the “signal to total” ratio of a forecast model, where *signal* is the variance of the ensemble mean at fixed lead time, and *total* is the corresponding total variance of the forecast ensembles. For a multivariate, stationary, Gaussian, Markov process, maximizing APT leads to the generalized eigenvalue problem

where **q** is the desired projection vector, **Σ**_{τ} is the time-lagged covariance matrix of the process, *τ* is the time lag, and superscript “T” denotes the transpose matrix. The eigenvectors provide the basis for decomposing the multivariate time series into a complete, uncorrelated set of components ordered such that the first maximizes APT, the second maximizes APT subject to being uncorrelated with the first, and so on.

In practice, the data are first projected onto the leading principal components of the control runs [the inverse projection is somewhat subtle and discussed in Schneider and Held (2001) and DelSole and Tippett (2007)]. The results are virtually independent of the number of PCs in the range 10–100 PCs, presumably because the time series are relatively long. We choose 40 EOFs for displaying results, which is less than 1% the number of samples.

The time-lagged covariances become less certain as the time lag increases, since the amount of data available for averaging decreases with time lag. To produce more stable estimates, sample covariances of different control runs were averaged together. In addition, following DelSole and Tippett (2009a), we truncate the sum in (A1) to 20 yr and apply a Parzen window to the time-lagged covariances. The results are not sensitive to the truncation level as long as it is a small fraction of the total length of 4200.

To quantify time scale, we use the sample estimate of the integral time scale (1)

where *ρ̂ _{τ}* is the sample autocorrelation function of the time series,

*τ*is time lag (in years), and

*w*are coefficients for the Parzen window. The sample integral time scale of the first 15 components in each control run is shown in Fig. A1. The

_{τ}*T*

_{2}value of the leading component in each forced run is tabulated in Table 2. For observation-based estimates, only results after 1900 are used—uncertainties prior to 1900 are deemed too large to allow useful estimates (though the results are not very different if data prior to 1900 are used).

The horizontal line near the bottom of Fig. 11 shows the 5% significance level, which was estimated by Monte Carlo methods as follows. An appropriate null hypothesis for our test is that the data are drawn from a white noise process when sampled every 2 yr. It would be inappropriate to assume white noise for annual sampling because the ocean surface is highly correlated on monthly time scales, and this correlation translates into a correlation between annual averages. Consistent with this, assuming white noise for annual sampling leads to all components being statistically significant—that is, all components are distinguishable from white noise. Because APT is invariant to nonsingular linear transformation, the process can be assumed to be white in space without loss of generality. Accordingly, we generate a 40 × 2100 data matrix by drawing independent random numbers from a normal distribution with zero mean and unit variance. Components that maximize APT were then determined. For each component, a corresponding 2100-yr time series was derived. The integral time scale in each 150-yr chunk was determined for each component, yielding 14 *T*_{2} values per component. This procedure was repeated 100 times to generate 14 × 40 × 100 *T*_{2} values. The upper five percentile of the 14 × 100 *T*_{2} values of each of the 40 components was then determined. The leading 15 values are plotted in Fig. 11 as the horizontal line. Note that the *T*_{2} threshold values were doubled since the time step in the Monte Carlo method was 2 yr. Values below this line can be interpreted as statistically indistinguishable from white noise (when sampled every 2 yr). The figure shows that the first six components have statistically significant time scales. The leading component, called the IMP, has statistically significant *T*_{2} values in all control runs.

##### d. Forced-to-unforced discriminants

The expected pattern of response to climate forcing is determined by discriminant analysis. This method assumes that the variability in the forced run can be modeled as internal noise plus an independently varying signal (i.e., the response to climate forcing). Under this assumption, the variance due to external forcing and internal variability are additive and hence the variance in the forced runs should be larger than in the unforced runs. Therefore, we seek the pattern that maximizes the ratio of the variance in the forced runs to the variance in the unforced runs. This optimization problem is standard (Schneider and Held 2001; DelSole and Tippett 2007) and leads to the eigenvector problem

where Σ* _{f}* and Σ

*are covariance matrices for the forced and control runs, respectively, averaged over all models,*

_{c}**q**is the desired projection vector, and

*λ*is an eigenvalue giving the variance ratio. The covariance matrices describe spatial covariability only—no time lag information is used in the covariance matrices. The variance in each control run is measured with respect to the 300-yr mean of the control run, while variance in the forced run is measured with respect to the 1901–50 mean of the respective forced run. Eigenvectors are ordered by decreasing eigenvalue, in which case the first maximizes the variance ratio, the second maximizes the variance ratio subject to being uncorrelated with the first, and so on. As in APT analysis, the data are projected onto the leading 40 EOFs of the multimodel ensemble. The leading eigenvector will be called the “forced-to-unforced discriminant.”

##### e. Separating forced and unforced components in observations

To separate the forced and unforced components in observations, we represent the observed annual mean temperature anomalies *T*(*x*, *y*, *t*) as a combination of three components: the IMP pattern *p _{I}*(

*x*,

*y*), a forced pattern

*p*(

_{F}*x*,

*y*), and noise

*ε*:

where *β _{I}* and

*β*are time-varying amplitudes for the IMP and forced components, respectively, and

_{F}*ε*is a random term that varies in space and time. The amplitudes

*β*and

_{I}*β*are determined using generalized least squares, which extend ordinary least squares to the case of dependent noise. This procedure is now recognized to be equivalent to fingerprinting (Allen and Tett 1999). To do this, we rewrite (A4) in matrix form as

_{F}where 𝗧* _{t}* is an

*M*-dimensional vector giving the observed temperature anomaly at time

*t*; 𝗣

*is an*

_{t}*M*× 2 matrix, whose two columns are the patterns

*p*and

_{I}*p*; and the other terms are interpreted in an obvious manner. Although the patterns

_{F}*p*and

_{I}*p*do not change in time, their representation on the observation grid after missing values are masked out does change. If

_{F}*has a covariance matrix of*

**ε***σ*

_{ε}^{2 }

**Σ**

_{ε}, then the generalized least squares estimate of

*is*

**β**_{t}where the hat symbol denotes an estimated quantity. The estimated standard error of the *i*th element of *β** _{t}*, denoted , is

where *σ̂ _{ε}*

^{2}is a sample estimate of

*σ*

_{ε}^{2}given by

The noise covariance matrix **Σ**_{ε} is estimated by the average sample covariance matrix of the individual control runs. Although **Σ**_{ε} does not change with time, the pattern of observations with nonmissing values depends on time, and hence the covariance matrix formed by extracting the matrix elements corresponding to the observed grid points depends on time. The condition number of the noise covariance matrix was found to depend significantly on the observation network, with large condition numbers occurring in the pre-1900 era. To avoid unstable estimates, the inverse covariance matrix was approximated using only the leading 20 eigenvectors of the noise covariance matrix in each year. A technical point is that all fields need to be area weighted before computing the eigenvectors, to ensure that the pseudoinverse is taken with respect to an area-weighted norm.

The amplitudes of the forced and IMP components for 10–40 EOFs are shown in Fig. A2. As can be seen, the amplitudes are nearly independent of EOF truncation for years after 1900. The results are more sensitive for six or fewer EOFs. The greater uncertainty in the pre-1900 period is presumably due to greater missing data that occurs during this period.

##### f. Missing observations

The standard error estimate in (A7) does not depend on missing data and hence does not account for uncertainty due to missing data. To estimate this uncertainty, we adopt the following resampling procedure. First, a period with reasonably complete observations was identified. We found that the 25-yr period 1981–2005 had no more than 20 missing values in any single year out of the 688 grid cells used in the calculations. This period is accordingly identified as the “data-rich period.” Then, for each year, the amplitudes of the patterns were computed for a year in the data-rich period for two different networks, namely, the network of observations in the data-rich period and the network of observations in the year in question. The difference between these two estimates gives an estimate of the uncertainty due to the missing data. Repeating this for all 25 yr in the data-rich period gives 25 error estimates from which the variance can be estimated. This variance is then added to the variance of the regression estimate [i.e., added to the square of Eq. (A7)] to obtain an estimate of the total uncertainty variance due to both finite sample size and missing observations.

The above approach to dealing with missing observations is not claimed to be optimal. Other methods, such as the regularized imputation method proposed by Schneider (2001), may produce more accurate estimates by exploiting space–time correlations in the data.

##### g. The ensemble mean IMP and its time scale

The ensemble mean IMP shown in Fig. 6 is calculated as the average IMP of 36 forced runs from the leading 14 models listed in Table 2. The error bars are computed as ±*σ*/, where *σ* is the standard deviation of the IMP in the forced runs at each year.

To estimate the time scale of the ensemble mean IMP, we fit the ensemble mean IMP from the forced runs to the first-order autoregressive model:

where *x _{t}* is the ensemble mean IMP,

*ε*is independent random error with zero mean and constant variance, and

_{t}*k*is a constant. Standard regression techniques give the estimates

*ϕ*= 0.87 and

*k*= −0.07. Following Leith (1973), the effective time scale is computed as

where *ρ*(*τ*) = *ϕ ^{τ}* is the autocorrelation function of the process. Substituting

*ϕ*= 0.87 gives the effective time scale

*T*≈ 7.7 yr.

##### h. Distribution of trends for a stochastic process

The trend of a stationary process *z _{t}* for

*t*∈ [0,

*T*] is obtained by fitting the equation

where

The least squares estimate of the trend parameter *β* is

Since the sum of *t* − *t*_{0} over *t* ∈ [0, *T*] vanishes, the mean of *β̂* vanishes. The variance of *β̂* is

where

The first term in parentheses in (A16) is the variance of the *T*-yr trend for a white noise process with variance *σ _{z}*. The second term in parentheses is an “inflation factor,” which accounts for autocorrelation in the time series.

To estimate the variance of trends for the IMP, we fit the IMP to an autoregressive model and use the resulting model to estimate the autocorrelation *ρ _{τ}*. The autocorrelation function for the best-fit AR1 and AR2 models turns out to be virtually indistinguishable. The best-fit AR1 model of the form (A9), where

*x*is substituted for the IMP, over the period 1900–2008 is found to be

_{t}*ϕ*= 0.806. The best-fit AR1 models for 1900–45 and 1946–2008 give statistically indistinguishable results. The standard deviation of the spatially averaged IMP is

*σ*

_{IMP}= 0.0872, which was substituted into (A16). The 95% confidence interval for the trend is then ±1.96, which is plotted in Fig. 8 as a function of trend period

*T*.

## Footnotes

*Corresponding author address:* Timothy DelSole, COLA, 4041 Powder Mill Rd., Ste. 302, Calverton, MD 20705. Email: delsole@cola.iges.org