## Abstract

The decadal predictability of sea surface temperature (SST) and 2-m air temperature (T2m) in the Geophysical Fluid Dynamics Laboratory (GFDL) decadal hindcasts, which are part of the Fifth Coupled Model Intercomparison Project experiments, has been investigated using an average predictability time (APT) analysis. Comparison of retrospective forecasts initialized using the GFDL Ensemble Coupled Data Assimilation system with uninitialized historical forcing simulations using the same model allows identification of the internal multidecadal pattern (IMP) for SST and T2m. The IMP of SST is characterized by an interhemisphere dipole, with warm anomalies centered in the North Atlantic subpolar gyre region and North Pacific subpolar gyre region, and cold anomalies centered in the Antarctic Circumpolar Current region. The IMP of T2m is characterized by a general bipolar seesaw, with warm anomalies centered in Greenland and cold anomalies centered in Antarctica. The retrospective prediction skill of the initialized system, verified against independent observational datasets, indicates that the IMP of SST may be predictable up to 4 (10) yr lead time at 95% (90%) significance level, and the IMP of T2m may be predictable up to 2 (10) yr at the 95% (90%) significance level. The initialization of multidecadal variations of northward oceanic heat transport in the North Atlantic significantly improves the predictive skill of the IMP. The dominant roles of oceanic internal dynamics in decadal prediction are further elucidated by fixed-forcing experiments in which radiative forcing is returned abruptly to 1961 values. These results point toward the possibility of meaningful decadal climate outlooks using dynamical coupled models if they are appropriately initialized from a sustained climate observing system.

## 1. Introduction

Multiyear to decadal climate prediction, characterized by combined signals from external radiative forcing changes and internal climate variations, is an initiative of the fifth Coupled Model Intercomparison Project (CMIP5), which will be assessed in the Intergovernmental Panel on Climate Change’s Fifth Assessment Report (IPCC AR5) (Meehl et al. 2009; Taylor et al. 2012). Although the capability to provide meaningful decadal climate outlooks using dynamical models has yet to be firmly established, the pioneering decadal hindcast experiments using initialized coupled models appear promising (Keenlyside et al. 2008; Smith et al. 2007).

A potentially large source of multiyear to decadal predictability may come from fluctuations in the Atlantic meridional overturning circulation (AMOC) (Delworth and Mann 2000; Knight et al. 2005; Zhang and Delworth 2006). The multidecadal variations of the basin-scale North Atlantic sea surface temperature (SST), generally referred to as the Atlantic multidecadal oscillation (AMO) (Enfield et al. 2001), have been hypothesized to be associated with AMOC fluctuations (Knight et al. 2005; Solomon et al. 2011). Zhang (2008) showed that the AMO is closely linked to North Atlantic subsurface temperature in observations and the Geophysical Fluid Dynamics Laboratory Climate Model version 2.1 (GFDL CM2.1) control simulation with constant external conditions, and it was suggested that the AMO has similar predictability as subsurface temperature and AMOC on the order of 10 years in perfect model predictability experiments using CM2.1 (Msadek et al. 2010). The potential predictability of North Atlantic upper-ocean temperature *O*(10 yr) was found in the CM2.1 long control integrations using an initial-value predictability metric (Branstator et al. 2012). AMO-like SST patterns were found in the uninitialized CMIP3 simulations for the twentieth-century, twenty-first-century, and preindustrial eras, while most models tend to produce AMO of shorter time scales and less periodic than suggested by observations (Ting et al. 2011). However, the existence of the AMO-like SST patterns in realistic decadal hindcasts has not been well established, and the retrospective prediction skill of such patterns using the GFDL fully coupled ensemble initialization and decadal forecasting system has not been assessed.

The expected gain from initialized decadal hindcasts is to capture the evolution of slow internal variations (e.g., AMO) of the climate system in addition to the relatively robust response to external forcings, so it is of necessity to detect the existence of those internal variations in the initialized hindcasts. The detection involves separating natural internal variability from anthropogenic forced response (Solomon et al. 2011). Here we take the approach of comparing parallel sets of initialized decadal hindcasts and uninitialized historical forcing simulations made with the same model. If the external forcing is identical, then the difference between the two sets of hindcasts, called internal residuals here, isolates the impact of initialization. We employ a new method, called the average predictability time (APT) optimization (DelSole and Tippett 2009a,b), to identify characteristic patterns of predictable components in the internal residuals of decadal hindcasts. The method successfully identified an internal multidecadal SST pattern from the CMIP3 multimodel unforced simulations (DelSole et al. 2011).

In this study, we apply APT analysis to investigate the decadal predictability of the annual mean SST and 2-m surface temperature in the GFDL IPCC CMIP5 hindcast experiments. Our main goals are to identify the internal multidecadal patterns in the initialized decadal hindcasts and assess the prediction skill of those patterns. Details of the hindcasts, observational datasets, and methods are discussed in section 2. In section 3, the internal multidecadal patterns for SST and T2m are identified by the APT analysis, the retrospective prediction skill is assessed using observations, and roles of ocean internal dynamics in decadal prediction are investigated. Conclusions and discussions are given in section 4.

## 2. Decadal hindcasts, datasets, and methods

The decadal hindcasts were initialized by the GFDL ensemble coupled data assimilation (ECDA) system. The ECDA employs an ensemble-based filtering algorithm applied to the GFDL fully coupled climate model, CM2.1, which is one of two GFDL CMIP3 models (Delworth et al. 2006). More details of ECDA can be found in Zhang and Rosati (2010) and Zhang et al. (2007), and a comprehensive assessment of oceanic variability from the latest version of the ECDA analyzed from 1960 to 2010 can be found in Chang et al. (2012). This fully coupled model methodology was chosen to produce a balanced state between the atmosphere and ocean. The 10-member ensemble decadal hindcasts were initialized on 1 January every year from 1961 to 2011 and integrated for 10 years with temporally varying anthropogenic and natural forcing, giving a total of 5100 years of hindcasts.

For the historical forcing simulations, the 10 ensemble members using CM2.1 were integrated using temporally varying anthropogenic and natural forcing for the 1861–2020 period (Knutson et al. 2006). Note that the temporally varying anthropogenic and natural forcings between 1961 and 2020 are exactly the same for the historical forcing simulations and decadal hindcasts. To elucidate the role of initialization in predicting the internal multidecadal climate variations, we conducted a set of 10-member ensemble fixed-forcing decadal prediction experiments initialized on 1 January every five years from 1965 to 2010. In the fixed-forcing experiments, the values of the anthropogenic and natural forcings are returned to 1961 conditions, and the initial conditions are exactly the same as in the decadal hindcasts.

The mean forced response was obtained from the ensemble mean of the 10-member historical forcing simulations. The decadal hindcast anomalies for each variable were obtained by subtracting out the lead-time-dependent climatology from hindcasts, which effectively removes the climate drift assuming that the climate drift is systematic with the forecast lead time. Then, the internal residuals were computed by subtracting the mean forced response from hindcast anomalies. We diagnose the predictability in the internal residuals as the predictability gain arising purely from initialization. Mathematically, the internal residual (*I*) for a hindcast variable *X* is defined as

where *R* is the mean forced response, and the overbar denotes the climatology of the hindcasts, which is a function of lead time *τ*.

The observational datasets for evaluating prediction skill include the sea surface temperature from the U.K. Met Office Hadley Centre’s global Sea Ice and Sea Surface Temperature dataset, version 1.1 (HadISST 1.1) (available online at http://badc.nerc.ac.uk/data/hadisst/) (Rayner et al. 2003) and the Extended Reconstruction Sea Surface Temperature (ERSST) analysis version 3b (Smith and Reynolds 2004); the 2-m temperature (T2m) from the National Centers for Environmental Prediction–National Center for Atmospheric Research (NCEP–NCAR) reanalysis (NNR) (Kalnay et al. 1996; Kistler et al. 2001) from 1948 to 2011; and the Twentieth-Century Reanalysis (20CR) from 1900 to 2010 (Compo et al. 2011). Consistent with the hindcasted internal residuals, we obtain observed internal residuals by subtracting out the long-term linear trend covering the whole dataset for each variable. Van Oldenborgh et al. (2012) reported that verification of the decadal climate prediction skill does not depend strongly on the definition of the trend, and we obtain similar results of decadal predictive skill when a quadratic trend is removed from observations, so we choose removing the long-term linear trend from observations in this study.

The observed AMO index used in Figs. 1 and 4 was downloaded from http://www.esrl.noaa.gov/psd/data/timeseries/AMO/. The AMO index is defined as the detrended, area-weighted-average SST over the North Atlantic from 0° to 70°N using the Kaplan SST dataset (Kaplan et al. 1998). The index was annually averaged and normalized to unit variance.

Complete details of the average predictability time (APT) can be found in DelSole and Tippett (2009a,b). Briefly, the method is to optimize APT, which is defined as the integral over lead time of the “signal to total” ratio of a forecast mode:

where is the variance of the ensemble mean at fixed lead time *τ* and is the corresponding total variance of the forecast ensembles. For the ensemble forecasts, the signal and total covariance can be approximated by the corresponding ensemble covariances. Following DelSole and Tippett (2009a), maximizing APT in ensemble forecasts leads to the generalized eigenvalue problem

where **q** is the projection vector, is the forecast ensemble mean covariance matrix at the lead time *τ* for a given variable, and is the total ensemble covariance matrix over all lead times. The eigenvectors **q** provide the basis for decomposing the multivariate time series into a complete, uncorrelated set of components ordered such that the first maximizes APT, the second maximizes APT subject to being uncorrelated with the first, and so on. The eigenvalues of (3) correspond to the APT values of each component. This decomposition based on APT is analogous to empirical orthogonal function (EOF) analysis except that we decompose predictability instead of decomposing variance. More detailed descriptions of the APT calculations are given in the first section of the appendix.

## 3. Results

### a. The IMP of SST

We first apply APT analysis to the internal residuals of the annual mean SST. As discussed in section 2, using the internal residuals ensures that the obtained predictability is due to internal variability and not to natural and anthropogenic forcing. The component with maximum APT in decadal hindcasts is shown in Fig. 1a. The pattern is predominantly of a general interhemisphere dipole with warm anomalies centered in the North Atlantic subpolar gyre region and North Pacific subpolar gyre region and cold anomalies centered in the Antarctic Circumpolar Current region. Note that the amplitudes of the pattern are considerably larger in the North Atlantic subpolar gyre region (over 0.5°C) than those in the North Pacific subpolar gyre and the Antarctic Circumpolar Current (about 0.1°–0.2°C) regions. The APT for this component is 18.9 yr, and the fraction of explained annual mean variance by this component is about 6.8%. The APT of this component is found to be statistically significant from a white noise process (see the first section of the appendix). Note that APT measures the time scale of predictability in a perfect model scenario, whereas the quantification of the actual predictive skill for this component remains to be verified against observations. Since the component has loadings concentrated in the North Atlantic subpolar gyre (SPG) and has temporal variations closely following the AMO index as shown next, we refer to this component as the AMO-like internal multidecadal pattern (IMP).

The APT value is probably underestimated because we only integrate APT up to 10-yr lead time in (3) due to the 10-yr upper limit of the hindcasts. If we had longer forecasts, we would expect to obtain components with at least as much APT as found in the 10-yr case. The APT also could depend significantly on model: DelSole et al. (2011) found that the APT estimated from preindustrial control runs of the CMIP3 dataset varied by more than a factor of 4 (see also Branstator et al. 2012). However, the model-to-model predictability variations in decadal hindcasts could be reduced because all model states are initialized from the same climate observing system, although the initialization methods vary among model hindcasts. The sensitivity of APT to different model hindcasts is beyond the scope of this study.

The time series of the component with maximum APT in the decadal hindcasts as a function of initialized years and lead time every five years from 1961 to 2010 are shown by shading in Figs. 1b,c. To assess the forecast skill of the component, we project the detrended ERSST and HadISST data onto the eigenvector **q** with maximum APT from (3) to obtain the observed time series, which are indicated by solid lines in Figs. 1b,c. A striking aspect of two observed time series is the multidecadal oscillations with negative anomalies during 1965–95 and positive anomalies during 1925–60 and 1996–2010. Also, the two observed time series are strongly correlated with the annual mean observed Atlantic multidecadal oscillation index (the correlation coefficients are 0.75 and 0.77 for ERSST and HadISST datasets, respectively). Interestingly, the initialized forecasts of the component closely follow the observed AMO index during the negative (1961–94) and positive (1996–2010) phases, and even the phase shift around 1995.

To facilitate comparison with statistical prediction methods, we chose a simple persistence model, a commonly used benchmark for skill, which assumes the forecast equals the initial condition. The anomaly correlation (AC) coefficients between forecasts and ERSST observation as a function of lead time, shown in Fig. 2a, are statistically significant up to 4-yr lead time at 95% significance level and up to 10 yr at 90% significance level for the hindcasts, while they are statistically significant up to 3-yr lead time at 95% significance level and up to 4 yr at 90% significance level for the persistence forecasts. Verification of predictions against HadISST data yields similar predictive skill (figure not shown). The *p* value of testing significance difference between two AC coefficients for model hindcasts and persistence forecasts, shown in Fig. 2b, tends to decrease from about 0.45 to 0.15 with forecast lead time. Note that the critical values and *p* values were computed by adjusting the effective sample size accounting for autocorrelations in the data (see the second section of the appendix). Although 41–50 forecasts were used to compute the skill of forecasts at given lead time, the effective sample size is generally less than 10 owing to the strong autocorrelations of the forecasts and observations, resulting in relatively large *p* values (larger than 0.1). Given these relatively large *p* values, we choose 0.25 as the threshold *p* value for testing the difference between two correlation coefficients. We conclude that the prediction skill of model hindcasts is statistically better than that of persistence forecasts for 8–10-yr lead times at 75% significance level, and not statistically different for 1–7-yr lead times.

The covarying structure of the AMO-like IMP in the North Atlantic and North Pacific is similar to the IMP identified from the CMIP3 multimodel unforced simulations using the APT analysis (DelSole et al. 2011), which may be explained as the impact of AMO on North Pacific climate variability through atmospheric teleconnections (Zhang and Delworth 2007). It was also indicated in previous studies that the strong signals of SST variability associated with AMO exist in the North Atlantic subpolar gyre in observations as well as model simulations (Folland et al. 1986; Kushnir 1994; Delworth and Mann 2000). Previous model studies indicated that multidecadal Atlantic meridional overturning circulation variations force interhemispheric dipolar SST anomalies (Collins et al. 2006; Latif et al. 2006b), and the out-of-phase relationship of SST variations between the Antarctic Circumpolar Current region and the North Atlantic subpolar gyre region associated with the Atlantic multidecadal oscillation was also revealed in observations (Latif et al. 2006a; Wu et al. 2011). The advance of this study is that the AMO-like IMP is not only successfully identified in decadal hindcasts using APT analysis but this pattern as a whole may be predictable up to 4 yr at 95% significance level and up to 10 yr at 90% significance level in the GFDL fully coupled ensemble initialization and decadal forecasting system.

To further confirm that the predictable signals are not statistical artifacts of the APT technique, we directly analyze predictability of anomalous SST averaged over the North Atlantic subpolar gyre region (marked as a box bounded by 45°–65°N, 50°–10°W in Fig. 1a), where the strong predictable signals were indicated by APT analysis. The time series of anomalous SST in the North Atlantic subpolar gyre region in the decadal hindcasts as a function of initialized years and lead time every five years from 1961 to 2010 are shown by shading in Figs. 3a,b. The initialized forecasts closely follow the observations during the negative (1961–94) and positive (1996–2010) phases, and even the phase shift around 1995. Also, the ensemble mean of uninitialized historical forcing simulations cannot produce multidecadal variations associated with realistic AMO, indicating that the decadal predictability arises from initialization. When forecasts are verified against ERSST, the SST in the North Atlantic subpolar gyre region is predictable up to 4 yr at 95% significance level and up to 7 yr at the 90% significance level in the model hindcasts, while the persistence model has skill only up to 3 yr at 95% significance level and up to 4 yr at 90% significance level (Fig. 3c). Verification of model hindcasts and persistence forecasts against HadISST data yields similar predictive skill (figure not shown). The predictive skill of model hindcasts is statistically better than that of persistence forecasts for 3–10-yr lead times at 75% significance level, and they are not statistically different for 1–2-yr lead times. This local predictability analysis confirms the robustness of the APT diagnosis for identifying the AMO-like IMP.

### b. The IMP of T2m

The skill in predicting the IMP of SST on decadal time scales would imply skill in predicting the air surface temperature as a response to SST forcing; thus we apply the same APT analysis as described above to the internal residuals of the annual mean 2-m temperature. The component with maximum APT in decadal hindcasts is shown in Fig. 4a. The pattern is predominantly of a general bipolar seesaw with warm anomalies extending from the North Atlantic subpolar gyre region to Greenland and the Arctic Ocean and cold anomalies centered in Antarctica. The APT value for this component is 17.8 yr, and the fraction of explained annual mean variance by this component is about 4.5%. The APT of this component is found to be statistically significant. We project the detrended NNR and 20CR data onto the leading eigenvector **q** for T2m from (3) to obtain the observed time series, shown in Figs. 4b,c. Two observed time series for the T2m pattern show similar multidecadal variations in phase with the AMO as those for the SST IMP, for example, negative anomalies during 1965–95 and positive anomalies during 1925–60 and 1996–2010, suggesting the multidecadal variability of T2m arises from multidecadal SST variations. The anomaly correlation coefficients between forecasts and observations as a function of lead time, shown in Fig. 5, indicate that the T2m pattern as a whole may be predictable up to 2 yr at 95% confidence level and up to 10 yr at 90% significance level verified against the NNR data (except the 6-yr lead time). The *p* values of testing significance difference between 6-yr lead time AC coefficient and 8–10-yr lead AC coefficients are larger than 0.3, so the nominal increase of AC coefficients after 6-yr lead time could be due to sampling errors. Verification of the same predictions against 20CR data yields similar predictive skill (figure not shown). The predictive skill of model hindcasts is statistically better than that of persistence forecasts for 7–10-yr lead times at 75% significance level, and not statistically different for 1–6-yr lead times (Fig. 5).

Although the so-called “bipolar seesaw” pattern has been observed in the Greenland and Antarctica ice core data on millennial time scales (Blunier and Brook 2001; Blunier et al. 1998), Chylek et al. (2010) identified a bipolar seesaw pattern in the twentieth-century Arctic and Antarctic instrumental temperature records on multidecadal time scales, and found that the Arctic (Antarctic) detrended temperatures are highly correlated (anticorrelated) with the AMO index, suggesting the Atlantic Ocean as a possible link between the climate variability of the Arctic and Antarctic regions. Our results not only confirmed the existence of the bipolar seesaw pattern on multidecadal time scales in phase with AMO, but progressed toward predicting the pattern as a whole on multiyear time scales in the GFDL fully coupled ensemble initialization and forecasting system.

### c. Roles of ocean internal dynamics in decadal prediction

Previous studies with coupled models assuming perfect ocean initial conditions indicate that accurate initialization of the Atlantic MOC may allow Atlantic multidecadal variability to be predicted a decade or more in advance (Collins et al. 2006; Msadek et al. 2010), and the AMO is suggested to be induced by Atlantic MOC variations and associated oceanic heat transport fluctuations (Knight et al. 2005), so we examine the North Atlantic oceanic heat transport in ECDA and decadal hindcasts. The regression coefficients of North Atlantic northward oceanic heat transport between 30° and 70°N onto the time series of normalized North Atlantic subpolar gyre region SST anomalies for ECDA, shown in Fig. 6a, tend to be positive, indicating that ECDA captures the anomalous northward (southward) heat transport into the North Atlantic subpolar gyre associated with the positive (negative) phase of AMO. These results indicate that the heat transport variations in the North Atlantic are in phase with the AMO in both ECDA and hindcasts, which is consistent with that the AMOC variations are in phase with the observed AMO using observational and modeling results (Zhang 2007, 2008).

Consequently, the decadal hindcasts initialized by ECDA also show similar anomalous northward heat transport into the North Atlantic subpolar gyre region. The time series of the anomalous heat transports averaged between 35° and 65°N in the decadal hindcasts as a function of initialized years and lead time for 11 hindcast cases from 1961 to 2010, shown in Figs. 6b,c, generally have the same sign as the anomalous heat transports in the ECDA, and show anomalous southward heat transports from the 1960s until the early 1980s and anomalous northward heat transports from the late 1990s until 2010. In contrast, the oceanic heat transport in the uninitialized historical forcing simulations shows a secular weakening trend from 1960 to 2020 without any multidecadal variations. The predictive skill of the heat transport is verified against ECDA since there is no observational long-term data available for oceanic heat transport. The model hindcasts have skill up to 2 yr at 95% confidence level, while the historical forcing simulations show negative correlation with ECDA (Fig. 7). These results suggest that the initialization of multidecadal variations of northward oceanic heat transport in the North Atlantic significantly improves the predictive skill of SST.

The forecast anomalies of the North Atlantic SST and North Atlantic oceanic heat transport averaged between 35° and 65°N for the fixed-forcing decadal prediction experiments and the decadal hindcasts are shown in Fig. 8. The fixed-forcing forecasts of the North Atlantic SST show similar multidecadal variations as the decadal hindcasts as well as the phase transition around 1995 (Fig. 8a), indicating that the internal signals due to initialization dominate over the forced signal on the decadal time scale in the model. Meanwhile, the fixed-forcing forecasts of the North Atlantic SST initialized in 1995 and 2005 show a cooling during about 3–10-yr lead time resulting from the return of radiative forcing to 1961 values. This is similar to the fast cooling of the global mean temperature as a response to an instantaneous return to preindustrial forcing (Held et al. 2010), but the initial decay rate of the North Atlantic SST is much slower than that of the global mean SST (Fig. 8c). Note that the global mean SST in the fixed-forcing experiments shows a similar fast cooling as the global mean temperature in the experiments by returning radiative forcing to preindustrial values (Held et al. 2010). The slower decay of North Atlantic SST than the global mean SST is due to the fact that the northward heat transports in the North Atlantic evolve similarly during the 10-yr forecasts in the two experiments (Fig. 8b), suggesting that the internal oceanic heat transport compensates the fast cooling of the North Atlantic SST. The fixed-forcing experiments further demonstrate that the ocean internal dynamics play dominant roles on decadal prediction over the external forcings.

The latest forecast shows that the AMO-like SST pattern is weakening in the coming decade 2011–20 but is still in the warm phase (Fig. 1). Consistently, the bipolar seesaw pattern of T2m is predicted to be weakening but still above the climatology in the coming decade (Fig. 4). The dynamical basis for these forecasts is that the North Atlantic northward heat transport is predicted to be weakening (Fig. 6). This predicted weakening of AMO-like patterns in the coming decade using a dynamical model is consistent with the statistically predicted weakening of the AMOC using subsurface and surface fingerprints (Mahajan et al. 2011).

## 4. Conclusions and discussion

The decadal predictability of SST and the 2-m air temperature (T2m) in the GFDL decadal hindcasts has been investigated using the average predictability time (APT) analysis. By diagnosing the internal residuals between initialized hindcasts and uninitialized historical forcing simulations using the same model (GFDL CM2.1), internal multidecadal patterns (IMP) for SST and T2m were identified. The IMP of SST is predominantly of an interhemisphere dipole with warm anomalies centered in the North Atlantic subpolar gyre region and North Pacific subpolar gyre region and cold anomalies centered in the Antarctic Circumpolar Current region. The IMP of T2m is predominantly of a general bipolar seesaw with warm anomalies extending from the North Atlantic subpolar gyre region to Greenland and the Arctic Ocean and cold anomalies centered in Antarctica.

The projected time series of IMP onto observations closely follow the multidecadal oscillations, with negative anomalies during 1965–95 and positive anomalies during 1925–60 and 1996–2010, in phase with the observed Atlantic multidecadal oscillation (AMO) index. The retrospective prediction skills of the IMP for SST and T2m were verified against independent observational datasets, revealing that the SST pattern is predictable up to 4-yr lead time at 95% significance level and up to 10 yr at 90% significance level, and the T2m pattern is predictable up to 2-yr lead time at 95% significance level and up to 10 yr at 90% significance level. Further analysis suggests that the initialization of multidecadal variations of northward oceanic heat transport in the North Atlantic significantly improves the predictive skill of the AMO-like IMP. The fixed-forcing decadal prediction experiments, in which radiative forcing is returned abruptly to 1961 values, further elucidate roles of ocean internal dynamics in the decadal prediction.

Beyond the robust long-term anthropogenic signal predicted by uninitialized CMIP3 climate models (Hegerl et al. 2007), the results presented here from the GFDL CMIP5 decadal hindcasts show that the coupled climate model, initialized by advanced data assimilation methods, may be capable of predicting AMO-like internal multidecadal pattern over multiyears to a decade, thus pointing toward the possibility of meaningful decadal climate outlooks using dynamical models if they are appropriately initialized by the climate observing system. However, the predictable signal of SST is primarily located in the North Atlantic subpolar gyre, and not a basin-scale signal observed in the North Atlantic. The predictable signals of T2m are mainly over the two polar regions, but not over other continents indicated by previous studies (Hermanson and Sutton 2010). This discrepancy may be attributable to model biases and lack of deep-ocean observations in the twentieth century. Another possibility is that the continental predictability may be related to some other SST pattern that is less predictable than the AMO-like pattern (Jia and DelSole 2011).

A challenge for decadal prediction is the lack of many decadal-scale events (e.g., AMO transitions) for assessing its reliability. In contrast, there are many observed ENSO cases for assessing seasonal climate prediction. In this study, the hindcast period from 1961 to 2011 covers only one AMO episode, so the reliability of the predictive skill presented here needs to be evaluated by hindcasts covering more AMO episodes. More research is needed on producing a larger set of initial conditions covering more AMO episodes for assessing decadal climate prediction. Nevertheless, the skill of the dynamical model prediction is consistently better than that of the persistence forecasts, especially for forecast lead times longer than 5 years, thus the results presented here are encouraging for CMIP5 decadal prediction initiatives using initialized dynamical models.

## Acknowledgments

We thank Isaac Held, Takeshi Doi for helpful reviews of an earlier draft, and three anonymous reviewers for constructive comments that helped to improve the manuscript. This research was supported by the Visiting Scientist Program at the National Oceanic and Atmospheric Administration’s Geophysical Fluid Dynamics Laboratory, administered by the University Corporation for Atmospheric Research.

### APPENDIX

#### Details of Calculation

##### a. APT

For solving the APT optimization problem [Eq. (2) in the main paper] in practice, the data are first projected onto the leading principal components (PC) (DelSole et al. 2011). The sensitivity of APT to the truncation of PCs for SST, shown in Fig. A1, indicates that the results for the time series of IMP and projection of IMP onto observations are virtually independent of the number of PCs in the range of 20–40 PCs, presumably because the time series of 5100 years are relatively long. We obtained similar results for the APT analysis of T2m (figures not shown). We choose 30 EOFs for displaying results for both SST and T2m in the paper.

Following (DelSole et al. 2011), the statistical significance test of APT was estimated by Monte Carlo methods. The null hypothesis for the test is that the data are drawn from a white noise process when sampled every 2 yr. Accordingly, we generate a 30 × 2500 data matrix by drawing independent random numbers from a normal distribution with zero mean and unit variance. The time dimension of the data was grouped as a set of 25 10-yr forecasts with 10 ensemble members. Thirty APT values were then determined. This procedure was repeated 1000 times to generate 1000 × 30 APT values. The upper fifth percentile of the 1000 × 30 APT values was then determined as the threshold values, plotted in Fig. A2 as the horizontal line. The figure shows that the first 25 (10) components of SST (T2m) have statistically significant APT values. However, only the leading component has multiyear predictive skill verified against observations, so we only focus on the leading component in this study.

##### b. Statistical tests

Two statistical significance tests were performed in this study. One null hypothesis is that the anomaly correlation coefficient is zero, and critical values at 95% (90%) significance levels are computed for assessing the predictive skill at each lead time. The other null hypothesis is that two anomaly correlation coefficients are not different from each other. We compute the *p* value of testing the difference of two anomaly correlation coefficients for the decadal hindcasts and the persistence forecasts at each lead time.

Autocorrelation in the data was accounted for by adjusting the effective sample size (*N**) using the following procedure (Trenberth 1984; Bretherton et al. 1999):

where *N* is the number of sample pairs, and *ρ _{xx}*(

*j*) and

*ρ*(

_{yy}*j*) are the sample autocorrelation of

*x*and

*y*at lag

*j*. The effective

*N** was used for computing the critical values of testing the significance difference of the anomaly correlation from zero, and

*p*values of testing the significance difference of two correlation coefficients. Although 41–50 samples were used to compute the skill of forecasts at each lead time, we found that the effective sample size is generally less than 10 owing to the strong autocorrelations of the forecasts and observations.

## REFERENCES

*Climate Change 2007: The Physical Science Basis,*S. Solomon et al., Eds., Cambridge University Press, 663–745.