## Abstract

The suitability of a linear inverse model (LIM) as a benchmark for decadal surface temperature forecast skill is demonstrated. Constructed from the observed simultaneous and 1-yr lag covariability statistics of annually averaged sea surface temperature (SST) and surface (2 m) land temperature global anomalies during 1901–2009, the LIM has hindcast skill for leads of 2–5 yr and 6–9 yr comparable to and sometimes even better than skill of the phase 5 of the Coupled Model Intercomparison Project (CMIP5) model hindcasts initialized annually over the period 1960–2000 and has skill far better than damped persistence (e.g., a local univariate AR1 process). Over the entire post-1901 record, the LIM skill pattern is similar but has reduced amplitude. Pronounced similarity in geographical variations of skill between LIM and CMIP5 hindcasts suggests similarity in their sources of skill as well, supporting additional evaluation of LIM predictability. For forecast leads above 1–2 yr, LIM skill almost entirely results from three nonorthogonal patterns: one corresponding to the secular trend and two more, each with about 10-yr decorrelation time scales but no trend, that represent most of the predictable portions of the Atlantic multidecadal oscillation (AMO) and Pacific decadal oscillation (PDO) indices, respectively. As found in previous studies, the AMO-related pattern also contributes to multidecadal variations in global mean temperature, and the PDO-related pattern has maximum amplitude in the west Pacific and represents the residual after both interannual and decadal ENSO variability are removed from the PDO time series. These results suggest that current coupled model decadal forecasts may not yet have much skill beyond that captured by multivariate, predictably linear dynamics.

## 1. Introduction

The decadal prediction problem has been in an embryonic stage for decades. To progress, we could simply apply the climate community's long experience in understanding seasonal-to-interannual variability and improving its prediction to decadal time scales. For example, a wide array of both physical and empirical methods has been used to make ENSO forecasts (e.g., review by Latif et al. 1998). Statistical forecasts are relatively easy and economical to perform, and their skill can be sufficiently high that they are useful both in their own right and as benchmarks for more complex numerical models (e.g., Livezey 1999; van Oldenborgh et al. 2005; Laepple et al. 2008; Krueger and von Storch 2011).

It seems reasonable then that a similar two-pronged approach of physical and empirical methods could advance decadal prediction. This is not to say that improvement will or can occur as readily as for seasonal forecasting. One concern is that, while ENSO provides a well-defined interannual phenomenon understandable as the result of a defined mechanism (e.g., delayed oscillator theory, recharge–discharge mechanism), there do not yet appear to be so clearly defined decadal phenomena, at least in the Pacific. Large-scale patterns such as the Pacific decadal oscillation (PDO; Mantua et al. 1997) do not dominate decadal variability to the same degree as ENSO dominates interannual variability and moreover may represent the superposition and/or convolution of a few mechanisms (e.g., Schneider and Cornuelle 2005; Newman 2007, hereafter N07), including the low-frequency or reddened tail of interannual phenomena (e.g., Newman et al. 2003a; Vimont 2005), rather than the result of one identifiable physical process. The effects of anthropogenic climate change complicate comparison between models and observations, and how to distinguish natural from anthropogenically forced decadal variability is a fundamental problem (Solomon et al. 2011).

Currently, a number of modeling centers have carried out decadal “hindcasts” as part of phase 5 of the Coupled Model Intercomparison Project (CMIP5; Taylor et al. 2012). It is an important long-range goal of climate diagnosis to provide insights that will help improve decadal forecasts from these coupled GCMs (CGCMs). Here, we diagnose annual-to-decadal variability and predictability, both unforced and forced, with an empirically determined linear model of the observed system.

## 2. Multivariate red noise

Climate variability is often characterized by a notable separation between the dominant time scales of interacting processes. For example, compared to much longer ocean time scales, weather varies so rapidly that it has almost no memory. Weather forcing of the ocean can then be approximated as white noise forcing of a damped integrator or univariate red noise for an anomaly scalar time series, the simplest null hypothesis of climate (Hasselmann 1976). When extended to the more general case of anomalies representing many evolving regional patterns of climate variables, this approximation based on time-scale separation becomes “multivariate red noise” (more generally known as a multivariate Ornstein–Uhlenbeck process, or a continuous version of a multivariate AR1 process). As opposed to its univariate counterpart, multivariate red noise contains both stationary and propagating anomaly patterns (so that scalar indices derived from it can have spectral peaks) and allows for nonsymmetric dynamical relationships (so that, despite the lack of exponential modal instability, some anomalies experience significant but transient growth and evolution over finite time intervals). A nonlinear system usefully approximated by multivariate red noise can be said to be “predictably” linear.

Linear inverse models (LIMs; Penland and Sardeshmukh 1995), which empirically determine multivariate red noise from observations, provide excellent approximations of observed Pacific SST anomaly evolution on time scales ranging from weeks to years. N07 found that a LIM constructed from annually averaged tropical and North Pacific SSTs reproduced observed tropical–North Pacific relationships on decadal time scales better than most CMIP3 coupled GCMs. Subsequent studies have had similar success in the Atlantic (Hawkins and Sutton 2009; Zanna 2012) and both ocean basins (Vimont 2012).

In this paper, the N07 analysis is extended to a state vector constructed from global SSTs and surface land temperatures. The resulting LIM is shown to have skill comparable to three CMIP5 decadal hindcast models that used yearly start dates for the period 1960–2000. The sources of this skill are diagnosed and evaluated in the context of simpler climate indices.

## 3. Data and model details

Datasets used in this study were SSTs from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST; Rayner et al. 2003) and surface (2 m) land temperatures from the University of East Anglia Climatic Research Unit (CRU) TS 3.1 dataset (Mitchell and Jones 2005) during the period 1901–2009. Monthly data were interpolated onto 2° latitude × 5° longitude grid boxes. Anomalies were determined by removing the climatological monthly mean from data that were temporally smoothed with a 12-month running mean. This aids our analysis, which does not consider seasonality. However, seasonality is likely still relevant to decadal variability (e.g., Newman et al. 2003a; Vimont 2005). Data were prefiltered in an EOF space retaining about 81% of the SST and 62% of the surface land temperature variances. SST EOFs were determined within ice-free regions only; the remaining ocean regions were regressed onto the SST principal components (PCs) to provide complete spatial coverage.

The multivariate AR1 process for an anomaly state vector **x**,

is the integrated solution of the dynamical system

forced by white noise **ξ**, where _{1 }**=** exp(), **σ** is a random error vector, and *t* is measured in years. N07 determined (2) from Pacific SSTs, and we have likewise determined it from global SSTs (not shown). When including surface land temperatures, however, a few eigenmodes of decorrelate too rapidly to be sampled at 1-yr intervals, so here we take the simpler route of instead determining (1) from observations. Note, however, that from both (1) and (2) the best forecast from initial conditions **x**(0) for a lead of *n* yr is

and the lag covariance statistics of **x** over a *n*-yr interval is

where and . This allows us to still make forecasts using the LIM and to test its assumption of predictably linear dynamics, and the eigenmodes of are still relevant since they are also the eigenmodes of _{1}.

The leading 11 (6) EOFs of anomalous SSTs (surface land temperatures) were retained for the model, with the corresponding PCs defining a 17-component state vector **x**(*t*).

Finally, the LIM must be tested on independent data, so estimates of _{1} and of hindcast skill were cross validated: the data record was subsampled by withholding 10% of it; then anomalies, EOFs, and _{1} were recalculated from the remaining 90%; and finally hindcasts were generated for the independent period. The procedure was repeated for the entire record. LIM hindcast skill was determined from these jack-knifed hindcasts, with hindcasts verified against the complete (untruncated in EOF space) gridded observations. EOF truncation was chosen to maximize globally averaged squared anomaly correlation skill of 1–2-yr cross-validated hindcasts, but results are fairly insensitive to this parameter choice. Additionally, linearly detrended data from the entire record was used with the same EOF truncation (albeit recalculated EOFs) to construct a second “detrended LIM.”

Hindcasts from the LIM are compared to hindcasts from three CMIP5 CGCMs: third climate configuration of the Met Office Unified Model [HadCM3 (DePreSys)]; Max Planck Institute Earth System Model, low resolution (MPI-ESM-LR); and Geophysical Fluid Dynamics Laboratory Climate Model, version 2.1 (GFDL CM2.1). These models were chosen since they were the only available models whose hindcasts were initialized yearly rather than every 5 yr. Skill was determined from the bias-corrected ensemble mean for each hindcast initialization.

## 4. Results

### a. Testing the empirical model

We first test the assumption of linear dynamics underlying the LIM. Figure 1 shows the observed lag-autocovariance for *n* = 2, 4, 6, and 8 yr compared to that predicted by (4). Generally, the match is quite good and confirms that the LIM reproduces the statistics of evolving surface temperature anomalies over the twentieth century, with deficiencies over Northern Europe and some parts of North America. Also, the LIM captures the 2-yr lag anticorrelation in the tropical eastern Pacific but underestimates its amplitude, likely because some subsurface ocean anomalies are needed in addition to SST to completely represent ocean state evolution within the LIM on time scales greater than 1 yr (Newman et al. 2011). Still, the LIM does implicitly include those subsurface effects that are linearly related to SST (Penland and Sardeshmukh 1995), in contrast to a physical dynamical model in which the evolution of the state vector is governed only by explicitly represented interactions among its components.

### b. Skill of LIM and CMIP5 decadal hindcasts

Figure 2 shows skill measured by local anomaly correlation for hindcasts averaged over leads of 2–5 yr (left panels) and 6–9 yr (right panels). The top two rows show that the LIM sets a much higher benchmark for skill than does damped persistence (i.e., a grid-space univariate AR1 model). Additionally, for yearly start dates from 1960 to 2000, skill of LIM decadal hindcasts is comparable to and sometimes better than skill from CMIP5 decadal hindcasts. Maxima (e.g., tropical Indian and Atlantic Oceans, central North Atlantic, southwestern United States, and east central Asia) and minima (e.g., eastern equatorial and northeast Pacific, western South America, and off U.S. Atlantic coast) of skill often coincide between the LIM and CGCM hindcasts, both those shown here and those documented in other studies (e.g., van Oldenborgh et al. 2012; Kim et al. 2012). LIM can thus serve as a benchmark for decadal forecast skill.

Hindcast skill for the Atlantic multidecadal oscillation (AMO) and PDO indices is shown in Fig. 3; for comparison, root-mean-square error (rmse) is also displayed. Here, the AMO index is the area-weighted North Atlantic mean SST, between 0° and 60°N, minus the global mean SST (Trenberth and Shea 2006; van Oldenborgh et al. 2012), and the PDO index is the projection of SST on the leading EOF of monthly detrended North Pacific SST anomalies between 20° and 60°N (Mantua et al. 1997). For the LIM, as previously found for CMIP5 hindcasts (Guemas et al. 2012; Kim et al. 2012; van Oldenborgh et al. 2012), AMO skill is generally higher than PDO skill, which drops off very rapidly for leads greater than 1 yr. The LIM again clearly provides a more stringent decadal forecast test than persistence (here determined from the indices time series and not the gridded values), except for short lead PDO forecasts, with skill that is generally higher (although not significantly so) than the CGCMs. Even global mean temperature hindcast skill is generally comparable.

### c. Diagnosing forecast skill

Since the geographical variations of skill are often similar between the LIM and CGCMs hindcasts, it seems possible that the sources of skill for CGCMs are largely the same as for the LIM, despite very great differences in model construction. The LIM's low order and simplicity makes it straightforward to assess surface temperature decadal predictability. For an infinite ensemble of a perfect model, forecast skill measured by the average anomaly correlation between forecast and verification anomalies is also a function of the forecast signal-to-noise ratio *S* at lead time *n*,

(Sardeshmukh et al. 2000). In the LIM, ensemble spread (due to noise) is assumed to converge to climatology with increasing *n* but to be state independent; so on average *S* is directly related to stronger predictable signal determined from (3) (Newman et al. 2003b). For leads under about 3 yr, transient anomaly growth due to interference of the nonorthogonal eigenmodes is important, but at longer leads forecast skill is higher for those initial states that project more strongly on the least damped eigenmodes of .

The three leading eigenmodes of are shown in Fig. 4 along with their associated projection coefficient time series. The leading eigenmode, stationary with a very long *e*-folding time (EFT), appears to represent the global secular trend pattern. The second/third eigenmode pair nominally propagates but effectively consists of two distinct quasi-stationary patterns since the period is very much greater than the 11-yr EFT. This eigenmode pair represents decadal variability, primarily over the Atlantic (most energetic phase) and over the Pacific (least energetic phase); it has no overall trend and is also recovered as the leading eigenmode pair for the detrended LIM, albeit with an EFT of about 9 yr. The time series in Fig. 4d is reminiscent of the AMO. Additionally, since Fig. 4c has a large global mean component, its variation gives rise to multidecadal fluctuations of the global mean temperature that reduce the warming between about 1945 and 1990 and enhance it before and after this period (not shown), consistent with some recent studies (e.g., Ting et al. 2009; DelSole et al. 2011; Wu et al. 2011). In the Pacific, Fig. 4e is very similar to the second leading eigenmode of the N07 LIM, which was the residual of the PDO once ENSO influences were removed. Note that this pattern has relatively larger amplitude in the western North Pacific, a region whose potential importance to PDO-like variability has also been highlighted by earlier studies (e.g., see review by Kwon et al. 2010). The remaining eigenmodes all have EFTs shorter than 3 yr, including the “interannual ENSO” (Fig. 4g) and “decadal ENSO” (not shown but similar to N07) eigenmodes that have EFTs of 1.5 and 2 yr, respectively. The relatively short decorrelation time scale but high variability of this eigenmode (Fig. 4h) limits predictability both along the equatorial east Pacific and in the northeast Pacific (as in N07); in essence, on decadal time scales ENSO is noise (Solomon and Newman 2011).

The impact of these eigenmodes upon hindcast skill over the entire 1901–2009 record was explored by creating several different sets of hindcast initializations, each with nonzero projections on different subsets of the eigenmodes. Figure 5 shows that almost all of the total hindcast skill (top panels of Fig. 5) can be recovered for hindcasts initialized with data projected on the leading three eigenmodes alone (middle panels of Fig. 5). Figure 6 shows the impact of the least damped eigenmodes on climate index skill. As in Fig. 5, almost all skill is retained for hindcasts initialized with the leading three eigenmodes only, except where ENSO eigenmodes also impact PDO skill (Fig. 6a) primarily at shorter leads. Longer-range PDO skill is related to the least energetic phase, while AMO skill (Fig. 6b) is related to the most energetic phase. Global mean temperature skill (Fig. 6c) is mostly due to the leading eigenmode but also has a contribution from the most energetic phase since it contributes to both the AMO and global mean. Note also that hindcast skill of the detrended LIM and full LIM is almost identical for both the AMO and PDO indices; in fact, hindcast skill of both indices is about the same whether or not the leading eigenmode is included in initializations of the full LIM. Apart from these indices, however, much of the LIM skill appears due to the trend (cf. top and bottom panels of Fig. 5) or alternatively the leading eigenmode, as has also been suggested for the CGCMs (van Oldenborgh et al. 2012).

## 5. Concluding remarks

A LIM, empirically constructed from annually averaged surface temperature observations using a 1-yr lag, has been shown to be a more suitable benchmark for decadal forecasts than damped persistence. In fact, it appears that CMIP5 CGCM decadal hindcast skill does not notably exceed that expected from a predictably linear dynamical system. To the extent that LIM and CGCM skill are comparable, both in amplitude and in geographical variation, the much simpler LIM can also be used to diagnose sources of forecast skill for both forecast systems.

Estimates of decadal predictability from even 100 yr of data are necessarily limited (e.g., Wunsch 2013), so it is important to view the results of this paper with some caution. In the LIM we obtained, virtually all long-range skill comes from the leading three eigenmodes with longest *e*-folding times. The leading eigenmode represents the global secular trend pattern while the next eigenmode pair represents decadal variability. Note that this eigenmode pair does not propagate with a multidecadal period but instead has a sufficiently long *e*-folding time that it varies on a multidecadal time scale. The most notable deficiency in CGCM hindcast skill compared to the LIM appears to be related to this eigenmode over the Pacific. It is interesting that the similar eigenmode found in the Pacific-only LIM was poorly simulated in all the CMIP3 preindustrial control and historical model simulations (N07; Solomon et al. 2011). Whether the global version of this eigenmode continues to be poorly represented by the CMIP5 models and, if so, why are subjects for further investigation.

## Acknowledgments

The author thanks Mike Alexander, Prashant Sardeshmukh, Amy Solomon, and three anonymous reviewers for helpful comments. This work was supported by grants from NOAA CVP and NSF 1035423.

## REFERENCES

*J. Climate,*

**24,**1276–1283.

*J. Climate,*

**23,**3249–3281.

*Analysis of Climate Variability: Applications of Statistical Techniques,*H. von Storch and A. Navarra, Eds., Springer Verlag, 177–196.

*Bull. Amer. Meteor. Soc.,*

**92,**485–498.

*Deep-Sea Res. II,*