## Abstract

Retrospective predictions of multiyear North Atlantic Ocean hurricane frequency are explored by applying a hybrid statistical–dynamical forecast system to initialized and noninitialized multiyear forecasts of tropical Atlantic and tropical-mean sea surface temperatures (SSTs) from two global climate model forecast systems. By accounting for impacts of initialization and radiative forcing, retrospective predictions of 5- and 9-yr mean tropical Atlantic hurricane frequency show significant correlations relative to a null hypothesis of zero correlation. The retrospective correlations are increased in a two-model average forecast and by using a lagged-ensemble approach, with the two-model ensemble decadal forecasts of hurricane frequency over 1961–2011 yielding correlation coefficients that approach 0.9. These encouraging retrospective multiyear hurricane predictions, however, should be interpreted with care: although initialized forecasts have higher nominal skill than uninitialized ones, the relatively short record and large autocorrelation of the time series limits confidence in distinguishing between the skill caused by external forcing and that added by initialization. The nominal increase in correlation in the initialized forecasts relative to the uninitialized experiments is caused by improved representation of the multiyear tropical Atlantic SST anomalies. The skill in the initialized forecasts comes in large part from the persistence of a mid-1990s shift by the initialized forecasts, rather than from predicting its evolution. Predicting shifts like that observed in 1994/95 remains a critical issue for the success of multiyear forecasts of Atlantic hurricane frequency. The retrospective forecasts highlight the possibility that changes in observing system impact forecast performance.

## I. Introduction

Predicting and projecting future North Atlantic Ocean hurricane activity is a topic of scientific interest (e.g., Gray 1984; Knutson and Tuleya 2004; Emanuel 2005; Camargo et al. 2007a; Vecchi et al. 2008; Smith et al. 2010, hereafter S10; Knutson et al. 2010; Vecchi et al. 2011; Villarini et al. 2011b; Villarini and Vecchi 2012b, 2013a,b) and high societal significance (Pielke et al. 2008; Mendelsohn et al. 2012; Peduzzi et al. 2012). The seasonal basinwide frequency of North Atlantic hurricanes has exhibited variability on a variety of time scales, from interannual to multidecadal, although it remains unclear whether there has been any century-scale trend in Atlantic hurricane frequency (e.g., Mann and Emanuel 2006; Vecchi and Knutson 2008, 2011; Landsea et al. 2010; Villarini et al. 2011a).

The scientific basis for predictions of seasonal hurricane activity at leads of one to three seasons has been developed (e.g., Gray 1984; Elsner and Jagger 2006; Vitart 2006; Camargo et al. 2007a,b; Vitart et al. 2007; Klotzbach and Gray 2009; Wang et al. 2009; Kim and Webster 2010; LaRow et al. 2010; Zhao et al. 2010; Alessandri et al. 2011; Chen and Lin 2011; Vecchi et al. 2011; Villarini and Vecchi 2013b), leading to the identification of different potential sources of skill, both local and remote.

Decadal to centennial projections of seasonal hurricane activity in response to changes in external forcing (greenhouse gases, aerosols, volcanoes, and solar) have been made (e.g., Oouchi et al. 2006; Knutson et al. 2008; Emanuel et al. 2008; Gualdi et al. 2008; Vecchi et al. 2008; Sugi et al. 2009, 2012; Zhao et al. 2009; Bender et al. 2010; Knutson et al. 2010; Villarini et al. 2011b; Zhao and Held 2011; Villarini and Vecchi 2012b, 2013a). The basis for these projections is the possibility that radiatively forced climate change could influence the climatic conditions to which hurricanes are sensitive, such as large-scale circulation, wind shear, ocean temperatures, potential intensity, and humidity (e.g., Emanuel 1987, 2007; Broccoli and Manabe 1990; Shen et al. 2000; Knutson and Tuleya 2004; Camargo et al. 2007b; Vecchi and Soden 2007a,b). Recent model results span a relatively wide range of possibilities for North Atlantic hurricane frequency (including increases or decreases) under enhanced CO_{2}-induced warming, while there is a wider tendency for hurricane intensity to increase in these studies (e.g., Knutson and Tuleya 2004; Knutson et al. 2008; Emanuel et al. 2008; Gualdi et al. 2008; Knutson et al. 2008; Vecchi et al. 2008; Sugi et al. 2009, 2012; Zhao et al. 2009; Bender et al. 2010; Knutson et al. 2010; Villarini et al. 2011b; Villarini and Vecchi 2012b, 2013a). There are indications that changes in atmospheric aerosols could influence past and projected hurricane activity, with increases (decreases) in Atlantic aerosol loading driving decreases (increases) in Atlantic hurricane activity (Mann and Emanuel 2006; Evan et al. 2009; Villarini and Vecchi 2012b, 2013a).

Assessing hurricane predictability at intermediate time scales, between seasonal predictions and multidecadal projections, is an emerging field of research. In addition to potential influences caused by changes in radiative forcing, internal variations of the climate system could play a large role in changes of hurricane frequency on time scales of decades (e.g., Goldenberg et al. 2001; Zhang and Delworth 2006, 2009; Knight et al. 2005; Latif et al. 2007; Dunstone et al. 2011; Villarini et al. 2011b; Villarini and Vecchi 2012b). There are physical reasons to expect coherent multiyear hurricane variations to be tied to ocean changes (e.g., Goldenberg et al. 2001; Zhang and Delworth 2005, 2006, 2009; Knight et al. 2005; Latif et al. 2007; Dunstone et al. 2011). There are also indications that some of the relevant ocean changes may be potentially predictable on decadal time scales (e.g., Griffies and Bryan 1997a,b; Pohlmann et al. 2004; Collins et al. 2006; Pohlmann et al. 2009; Msadek et al. 2010; S10; Teng et al. 2011; Chikamoto et al. 2013; van Oldenborgh et al. 2012; A. Rosati et al. 2012, unpublished manuscript; Yang et al. 2012; Yeager et al. 2012). As decadal variability and the associated predictability can result from both internally and externally forced fluctuations (e.g., Rotstayn and Lohmann 2002; Hawkins and Sutton 2009; C. Chang et al. 2011; Villarini et al. 2011b; Booth et al. 2012; Zhang et al. 2013; Villarini and Vecchi 2012b), one has to consider skill arising from both external factors and internal variability on multiyear time scales. A number of modeling groups are now following the same framework for phase 5 of the Coupled Model Intercomparison Project (CMIP5; Taylor et al. 2012) to be assessed as part of the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR5), by performing decadal predictions initialized with estimates of the observed state of the climate system (Taylor et al. 2012; Meehl et al. 2013). While for sea surface temperatures (SSTs) most of the skill on multiyear time scales arises from predicting the warming trend associated with radiative forcing changes (e.g., van Oldenborgh et al. 2012; A. Rosati et al. 2012, unpublished manuscript), there is at least one study suggesting that initialization can increase the skill in multiyear hurricane forecasts (S10). In this paper, we explore the ability of a hybrid statistical–dynamical hurricane forecasting system to retrospectively predict multiyear hurricane activity in the Atlantic using two different coupled climate models, including the one used by S10. We explore the skill of North Atlantic hurricane frequency resulting from changing radiative forcing and from natural variability. We assess the improvement in skill caused by initialization and discuss the source of this improved skill and its implications for future multiyear forecasts of North Atlantic hurricane frequency.

## 2. Data and methods

### a. Statistical hurricane emulator

We use a hybrid statistical–dynamical North Atlantic hurricane frequency prediction framework to explore the predictability of multiyear hurricane activity. This framework has been shown to exhibit retrospective skill in seasonal hurricane forecasts from as early as boreal winter prior to the hurricane season (Vecchi et al. 2011). It combines a statistical emulator of a high-resolution dynamical atmospheric model (Zhao et al. 2009, 2010) and initialized forecasts of SST. The statistical emulator is formulated as a Poisson regression model with two predictors: Tropical Atlantic SST and tropical-mean SST, each averaged over the August–October season.

The choice of these two predictors is motivated by dynamical considerations, observed relationships between hurricane activity and SST and the sensitivity of dynamical models to SST perturbations. Observational analyses have highlighted correlations between SST changes in the tropical Atlantic and hurricane activity indices (e.g., Elsner and Jagger 2006; Emanuel 2005). However, observational correlations as high or higher have been found between hurricane activity and the weighted difference between Atlantic and tropical-mean SSTs (the SST changes in the Atlantic relative to the tropics, or “relative SST”) by other studies (e.g., Swanson 2007, 2008; Vecchi et al. 2008; Villarini et al. 2010, 2011b, 2012; Villarini and Vecchi 2012a). The physical basis for exploring relative SST as a predictor of hurricane activity is based on the tendency of free tropospheric temperature changes to follow those of tropical-mean SST (Sobel et al. 2002) or SSTs in the Indo-Pacific Oceans region where the bulk of tropical convection resides (Tang and Neelin 2004) as described by the weak temperature gradient approximation (Sobel and Bretherton 2000). An Atlantic SST warming that is larger than that of the tropical average, with a tropospheric warming in the Atlantic that follows tropical-mean SST, would lead to a large-scale destabilization of the atmosphere in the Atlantic, to changes in the large-scale vorticity, shear, and atmospheric humidity, and to increases in tropical cyclone (TC) potential intensity (e.g., Latif et al. 2007; Vecchi and Soden 2007a; Gualdi et al. 2008; Sugi et al. 2009, 2012; Zhao et al. 2009; Xie et al. 2010; Zhao and Held 2011; Ramsay and Sobel 2011; Camargo et al. 2013; Vecchi et al. 2013). Supporting the notion of relative SST as a predictor for Atlantic hurricane activity, dynamical modeling studies have found that the threshold for TC genesis under projected climate changes over the twenty-first century increases along with the overall tropical warming (e.g., Knutson et al. 2008). The interannual, decadal, and climate change response of North Atlantic TC frequency simulated across a range of dynamical frameworks is also well explained by relative SST (e.g., Vecchi et al. 2008; Sugi et al. 2009, 2012; Zhao et al. 2009, 2010; Vecchi et al. 2011; Villarini et al. 2011b; Knutson et al. 2013; Zhao and Held 2011), although strong departures caused by moist adiabatic warming can complicate relative SST models of hurricane frequency (e.g., Vecchi et al. 2013).

Following Vecchi et al. (2011), we model the rate of occurrence *λ* (the expected value of the aggregate seasonal number) of North Atlantic hurricane frequency using a Poisson regression model as follows:

where SST_{MDR} and SST_{TROP} are anomalies in the regional SST indices relative to the 1982–2005 average, as described in section 2c. Note that SST_{MDR} is the average over the hurricane main development region (10°–25°N, 80°–20°W), and SST_{TROP} is the global, 30°S–30°N average of SST. As discussed in Vecchi et al. (2011), this statistical emulator of the sensitivity of hurricane frequency to SST changes in the Zhao et al. (2009, 2010) high-resolution atmospheric model was trained across a broad range of climate states, including multiple realizations of the historical period and various projections of twenty-first-century SST change. This statistical model's performance against the observed record satisfies a necessary condition for its application to interannual to decadal predictions (Vecchi et al. 2011). The parameters in this statistical emulator, built on the output of a high-resolution AGCM, are very similar to those that arise from modeling-adjusted hurricane frequency over the 1878–2008 period (Villarini et al. 2012). This statistical emulator is able to reproduce much of the observed variability in hurricane activity (*r*^{2} = 0.58; Vecchi et al. 2011), and its ability to recover changes in hurricane frequency compares well with hindcasts and projections from high-resolution dynamical models (e.g., Zhao et al. 2009, 2010; Villarini et al. 2011b; Knutson et al. 2013). The low computational cost of the statistical emulator allows us to efficiently perform a variety of retrospective forecasts using multiple input datasets, described below.

### b. Global climate model predictions

The statistical emulator (described above) is applied to predictions of SST from two global climate models: the National Oceanic and Atmospheric Administration (NOAA) Geophysical Fluid Dynamics Laboratory (GFDL) Climate Model, version 2.1 (CM2.1), and the Met Office (UKMO) Decadal Prediction System (DePreSys) Perturbed Physics Ensemble (PPE), referred to as GFDL-DecPre and UKMO-DePreSys, respectively. The forecast system specifications are summarized in Table 1. These models are just two of those that will be part of the CMIP5 decadal prediction experiments, although the CMIP5 version of UKMO-DePreSys is slightly different from the one used here. Exploration of those models allows us to compare the behavior of a prediction system that has shown skill in interannual hurricane predictions using the hybrid statistical–dynamical framework (i.e., GFDL-DecPre; Vecchi et al. 2011) and also to apply the hybrid framework to a model system that has shown high multiyear correlations using an alternative approach (i.e., UKMO-DePreSys; S10). Additionally, these two models generated a full ensemble of initialized predictions each year, rather than every five years as in many other CMIP5 experiments (Meehl et al. 2013), allowing us to more fully explore past performance.

The GFDL decadal climate hindcasts (i.e., GFDL-DecPre) are carried out over the period 1961–2011 using the GFDL CM2.1 coupled system (Delworth et al. 2006), in which both the atmosphere and the ocean are initialized through a full-field assimilation to bring the state of the coupled model close to observations. The initial conditions are produced with the GFDL fully coupled reanalysis version 3.1 of the GFDL Ensemble Coupled Data Assimilation System (ECDA3.1), which is based on an ensemble Kalman filter (Zhang et al. 2007; Zhang and Rosati 2010; Y. Chang et al. 2011b) and has been shown to produce a realistic ocean-mean state and variability (Chang et al. 2013). Ensembles of 10 members are produced starting on 1 January every year from 1961 to 2011 and run for 10 years. Historical radiative forcing is used for the 1961–2005 period and the representative concentration pathway (RCP) 4.5 scenario for the predictions starting after 2005. A 10-member ensemble of uninitialized runs with the same forcings has also been produced to investigate the impact of initialization. This forecast suite is further discussed in A. Rosati et al. (2012, unpublished manuscript) and its retrospective skill in predicting Atlantic multidecadal oscillation (AMO)-like variability is described in Yang et al. (2012).

The DePreSys (Smith et al. 2007) is based on the Hadley Centre Coupled Model, version 3 (HadCM3; Gordon et al. 2000). The UKMO-DePreSys PPE (S10) is an updated version that uses a 9-member ensemble of model variants that aims to sample model uncertainties through perturbations to poorly constrained atmospheric and surface parameters. Initial conditions are created by relaxing the model's components toward atmospheric [European Centre for Medium-Range Weather Forecasts (ECMWF) analysis and reanalysis] and oceanic (Smith and Murphy 2007) analysis, with values assimilated as anomalies with respect to the model climate. The purpose of anomaly assimilation is to minimize climate drift after the assimilation is switched off, but this does not totally suppress the bias as discussed in Robson (2011). The 10-yr-long decadal retrospective forecasts consist of 9-member ensembles starting on 1 November every year from 1960 to 2005. A parallel set of nine uninitialized experiments using the DePreSys PPE is also used, and is referred to as the UKMO-DePreSys uninitialized forecast runs. The DePreSys experiments do not include future volcanic information in them, only volcanic aerosols from eruptions prior to the initialization; thus, each initial year has a unique suite of uninitialized experiments. We use the UKMO-DePreSys PPE data rather than the CMIP5 UKMO-DePreSys output in order to make a comparison with the results of S10.

We also perform a two-model average prediction by first running the statistical emulator on the output from each model, and then averaging the predictions of the two models. Previous experience with interannual hurricane forecasts indicates that a two-model average can have advantages over each individual model (Vecchi et al. 2011). Further work with the full suite of CMIP5 models is underway (Caron et al. 2013).

### c. Lead-dependent climatology

The statistical hurricane emulator is defined in terms of SST anomalies (SSTAs) with respect to the 1982–2005 climatology (Vecchi et al. 2011). The initialized and uninitialized model forecasts have their own climatology, which—for initialized forecasts using both models and for uninitialized forecasts using UKMO-DePreSys PPE—can depend on the lead time of the forecast. The uninitialized forecasts of DePreSys PPE have a lead-dependent climatology because the history of radiative forcing seen by forecasts verifying on the same year can depend on the initialization year, because no “future” volcanic information is included in these uninitialized experiments. Therefore, we define a different climatology for each experiment (initialized and uninitialized) for each model (GFDL-DecPre and UKMO-DePreSys PPE). For the initialized model experiments we build a climatology that depends on lead time by averaging, for each lead time between 1 and 10 years, the forecasts that verify the years 1982–2005. We choose this as our reference period for two principal reasons: 1) the statistical model of Vecchi et al. (2011) was trained referenced to 1982–2005, and 2) the trade-off between trying to train over a period in which the observing system used to initialize the forecasts was relatively stable and there was the desire to have a long record to faithfully define the model drift. Using other reference periods does not alter the principal results of this manuscript. To compute the model climatology we average all 10 ensemble members for GFDL-DecPre, but because UKMO-DePreSys PPE is a “perturbed physics ensemble” a different climatology is defined for each of its 9 ensemble members. Note that a key impact of subtracting the lead-dependent climatology is to remove a systematic bias that arises in the forecasts as the models drift toward their own mean state when initialized with observations (Stockdale 1997; ICPO 2011). The drift of the models used here is toward each model's free running climatology, though even after 10 years there are regions where the initialized experiments have not yet settled at the free running climatology—these regions tend to roughly coincide with the regions where a potentially predictable decadal signal has been identified in the literature (e.g., Yang et al. 2012). A key assumption is that the systematic drift of the models does not depend on the initialization period—that is, that the systematic drift does not depend on the changes to the climate observing system that have occurred in the last 50 years. The stationary drift assumption has been shown to be problematic in interannual predictions, where change in observing system can modify the drift, and a suggested solution is to use different lead-dependent climatologies across major changes in observing system (e.g., Kumar et al. 2012). The assumption that the drift is stationary will be further discussed in section 4.

### d. Skill measures

We explore two statistical measures to quantitatively assess retrospective performance: the anomaly correlation coefficient (ACC) and the mean squared skill score (MSSS). These statistics are not independent, but offer slightly different views of the forecast model skill. The ACC is the sample correlation coefficient as a function of lead time *t* (or an average of lead times) between a set of forecast anomalies and observed anomalies over *j* = 1, … , *n* years after removing the mean of each:

where and the overbar denotes the time mean over the climatological period 1982–2005, which is a function of lead time *t*. The ACC values can range from −1 to 1, and they measure the degree to which large positive and negative excursions from the mean co-occur in the forecast and verification.

The root-mean-square error (RMSE) is often used as a measure of accuracy of the forecasts. It is defined as the square root of the mean square error (MSE):

We use here a related statistical measure, the mean squared skill score (Murphy 1988) following recommendations by Goddard et al. (2013). The MSSS is based on the MSE between the forecast and the observed climatology and represents the improvement in accuracy of the forecast over climatology:

The highest MSSS value of 1 is reached when MSE_{F} = 0 and .

Instead of using climatology as reference forecast one can use the MSE of the uninitialized projections (i.e., MSE_{P}) to evaluate the improved skill from initialization:

where a positive MSSS indicates that the initialized forecasts outperform the uninitialized ones. MSSS can be expressed as a function of correlation and conditional bias (Goddard et al. 2013), which is useful when interpreting an improvement of skill from initialization.

### e. Assessment of statistical significance

We explored three different estimates to assess statistical significance of the correlation results against a null of zero correlation, and to compute the confidence intervals of the retrospective correlations. For the estimates of statistical significance the effective number of degrees of freedom *N*_{eff} of the correlation of two time series *X* and *Y* was computed using the methodology described in Bretherton et al. (1999), using the biased estimates of autocorrelation spectrum of the various time series:

where *N* is the number of samples in each time series, and and are the estimates of autocorrelation of each time series at lag *τ*. Because of the large autocorrelation of the time-smoothed predicted and observed hurricane time series at even long lags, the effective degrees of freedom can be considerably smaller than the number of years in the time series. Typically, when compared with observations, the 5-yr mean initialized forecasts tend to have between 6 and 8 effective degrees of freedom and the uninitialized forecasts tend to have between 10 and 12 effective degrees of freedom—even though there are around 50 years of data that are compared. Without accounting for the strong autocorrelation in these time series, one would estimate much narrower confidence intervals and a smaller *p* value for the null hypothesis; failure to account for the diminished degrees of freedom can lead to a substantial overestimation of forecast skill.

Although hurricane frequency is not normally distributed, we are exploring multiyear averages of hurricane frequency, which allows us to approximate the distribution as normal. To compute confidence intervals of a correlation we use a two-sided test (because it is possible that initialization could lead to degradation in performance), and we use a one-sided test against the null hypothesis of zero correlation (because a significantly negative correlation would be a failure of the forecast system), and we have compared the results from three methods:

- Fisher's
*z*transformation: The sample estimate of the correlation coefficient*r*_{X}_{,Y}between two time series*X*and*Y*is transformed usingThe new quantity*Z*_{X}_{,}follows a_{Y}*z*distribution with an*N*_{eff}-of 3 degrees of freedom (Fisher 1915, 1924; von Storch and Zwiers 1999). Using standard*z*-statistic tables one can estimate the confidence intervals on the mean and test against a null hypotheses of zero mean from the sample estimate*Z*_{X}_{,}. To transform the confidence interval estimates of the_{Y}*z*statistic back to correlation space, we employ the inverse Fisher's*z*transformation: where is the estimate of the upper or lower bound on the confidence interval of the*z*statistic and is the estimate of the upper or lower bound on the confidence interval of the correlation coefficient. - Full distribution of the correlation coefficient: Johnson et al. (1995) provide the distribution of the sample correlation coefficient
*R*when the population correlation coefficient*ρ*is equal to zero: where Γ(·) is the gamma function and*n*is the sample size. This distribution is symmetric around the zero. By using*p*, we can test the null hypothesis of no correlation at a given significance level_{R}*α*, by checking whether the sample correlation coefficient lies within or outside the rejection or critical region. Monte Carlo estimate: For sample sizes ranging between 2 and 100, we build 100 000 estimates of the distribution of the sample correlation coefficient between two normally distributed time series of length

*N*_{eff}and an underlying correlation*ρ.*We sample underlying correlation coefficients between −1 and 1, at intervals of 0.01. From this Monte Carlo estimate of the probability density function (PDF) of the sample correlation coefficient, we estimate significance against a null of zero correlation as the probability of a correlation as large as or larger than a particular sample correlation given an underlying correlation of zero. In an analogous manner, we also compute the confidence intervals on the sample correlation given an underlying correlation.

We have compared the three estimates of the confidence intervals on the correlation coefficient and null test against a correlation of zero for the retrospective forecast correlations, and have found that they are consistent with each other. For simplicity, in the manuscript we only show the estimates from the Fisher's *z* transformation.

## 3. Results

### a. Retrospective hurricane forecasts

Figure 1 shows the 5-and 9-yr mean (centered on the midpoint of each interval) initialized and uninitialized forecasts of North Atlantic hurricane frequency in GFDL-DecPre and UKMO-DePreSys PPE compared with observations. The observed record of 5-yr mean hurricane frequency is largely characterized by two distinct states with low values (~5–6 hurricanes per year) in the first half of the record and a shift in the mid-1990s (e.g., Elsner et al. 2004; Li and Lund 2012) toward a more active state (~8 hurricanes per year). The uninitialized predictions capture a tendency for an increase in hurricane frequency over the late twentieth century, indicating that part of the recent increase in Atlantic hurricane frequency was caused by changes in radiative forcing, consistent with other recent findings (e.g., S10; Villarini and Vecchi 2012b, 2013a). However, the uninitialized experiments fail to capture the abrupt shift in the mid-1990s. The initialized retrospective forecasts show better qualitative agreement to observations than do the initialized runs, suggesting an improvement from initialization.

Despite the time averaging, both observations and the model predictions have year-to-year variability in 5-yr North Atlantic hurricane frequency, which complicates detection of decadal changes (Fig. 1). The year-to-year variations in the multiyear initialized forecasts are larger than that in observations, even though the forecasts are ensemble averages. This result is particularly striking given that the statistical emulator should only recover a fraction of the observed variance, and suggests that the initialized forecasts have too much internal variability. An alternative interpretation, which is discussed further in section 3c below, is that the initial conditions for each year's initialization persist too strongly, so that each initialization year's climate reflects the average of multiple subsequent years.

The anomaly correlation between the observed hurricane counts and the model predictions for both initialized and uninitialized experiments is shown in Fig. 2 for 5- and 9-yr means. A persistence forecast is given as a reference test forecast, where the 5-yr (9-yr) mean persistence is defined as the observed average over the 5 (9) years that precede the model's initialization (persisting the SSTA indices does not improve the performance of the persistence null model, with correlations ranging between 0.16 and 0.4 depending on the SST dataset used). So, for example, the persistence forecast for the lead 2–6 forecast centered in 1992 (e.g., initialized in 1989) is the observed hurricane count averaged over 1984–88. Consistent with Fig. 1, at lead 2–6 the initialized retrospective predictions show higher correlations than the uninitialized ones, for both models. The values are significantly different from zero and exceed the values given by persistence, which is not the case for the uninitialized predictions. Comparable skill is found between the two models, slightly higher in UKMO-DePreSys; these retrospective correlations are comparable to those reported in S10 using an alternative methodology applied to DePreSys PPE. Computing the two-model mean increases the signal-to-noise ratio, leading to higher correlations than in either individual model. At lead 2–10, all the predictions outperform the persistence forecast. The decadal correlations are nominally higher in the initialized retrospective predictions than in the uninitialized, with the largest values, exceeding 0.8, occurring when taking the two-model mean. This decadal skill does not come only from the first few years because the correlations at lead 6–10 are also large (Fig. 2), although it is ambiguous if there is improvement from initialization. At lead 6–10, GFDL-DecPre shows larger correlations for the initialized predictions but UKMO-DePreSys indicates higher values for the uninitialized runs, yielding undistinguishable values between the initialized and the noninitialized experiments for the two-model mean.

These results suggest that coupled GCMs that account for both changes in initial state and radiative forcings can lead to skillful multiyear retrospective predictions of hurricane frequency. The nominal improvement from initialization should, however, be interpreted with care given the large confidence intervals associated with the point estimates of the correlations (Fig. 2). As discussed above in section 2e, although the observed record is 50 years long, because of the large autocorrelation of the time series each year is not independent from those nearby. Hence, the effective number of degrees of freedom is largely reduced to less than 10 for most lead times, as indicated on Fig. 2, based on Bretherton et al. (1999). Therefore, even if the initialized predictions give a correlation that is statistically different from climatology and is nominally higher than in the uninitialized predictions, the large confidence intervals indicate that the retrospective correlation of the initialized forecasts is not different from persistence or the uninitialized experiments at *p* = 0.1. Some of the correlations of the initialized forecasts are significantly larger than the noninitialized experiments at *p* = 0.2.

The nonsignificance of the difference between the initialized and noninitialized correlations does not depend strongly on the effective sample size, as long as some level of autocorrelation is assumed. We recomputed the confidence intervals on the sample correlations using an unrealistic assumption that two years were needed for each new degree of freedom, and the initialized to uninitialized correlation differences were still not significantly different at *p* = 0.1. If we assume, even more unrealistically, that a new degree of freedom is achieved every 1.5 years, then the differences between the initialized and uninitialized experiments are significant at *p* = 0.1. However, we wish to stress that these perturbation experiments yield an extremely unrealistically high estimate of the number of degrees of freedom, considering we are exploring 5-yr running averages of quantities with a pronounced trend and interdecadal variation. The record is too short, and the difference between initialized and uninitialized correlations too small, to yield a statistically significant difference.

Improvement from initialization on the two-model mean lead 2–6 forecast is close to being significant even at *p* = 0.1, suggesting potentially higher confidence in multimodel ensembles. For the lead 2–6 and 2–10 forecasts, for both model systems there is a consistent nominal improvement of retrospective correlation from initialization relative to the uninitialized experiments. Because of this, and because of the small sample size, we speculate that the lack of significance at *p* = 0.1 may reflect a “lack of power” by the significance test, rather than a “lack of effect” from initializing (Johnson 1999). For the lead 6–10 forecast, however, the nominal difference between the initialized and noninitialized forecasts changes sign (there is nominal indication of improvement in GFDL-DecPre, but a nominal degradation in UKMO-DePreSys PPE), so we interpret the lack of significance in this case as indicating a lack of effect from initialization. Therefore, it appears that the nominal improvement in the lead 2–10 forecast arises in the first part of the decade and represents potential multiyear forecast skill rather than decadal skill.

A lagged-ensemble approach, in which past forecasts are used to augment the effective ensemble size of more recent forecasts (e.g., by creating a forecast where the current year's lead 1–5 and the previous year's lead 2–6 forecasts are averaged), can increase forecast performance [e.g., Vecchi et al. (2011) showed improvement in interannual hurricane forecasts from lagged ensembles]. We explored the impact of lagged ensembles in the retrospective hurricane forecasts (not shown) at lags of up to three years (i.e., averaging lead 1–5, 2–6, and 3–7 verifying the same years together), resulting in nominal improvements in the correlation coefficient (on the order of 0.02–0.05). However, the smoothing induced by the lagged ensemble led to a further reduction of degrees of freedom. Because the uncertainty in a correlation estimate increases with decreasing correlation or sample size, the uncertainty estimates on the correlation coefficient did not show substantial change: even after lagged-ensemble averaging, the retrospective correlation of the uninitialized and initialized forecasts were in each other's confidence intervals.

As a complement to the skill estimate using ACC, we show in Fig. 3 the MSSS for various 5- and 9-yr mean leads. Both the improvement relative to climatology [Eq. (4)] and that from initialization [Eq. (5)] are indicated on the *x* and *y* axes, respectively. None of the retrospective initialized forecasts has a negative MSSS on the *x* axis, which indicates at least a nominal improvement relative to climatology. An improvement from initialization is also suggested at all leads in GFDL-DecPre, and at most leads except for 5–9 and 6–10 in UKMO-DePreSys PPE, leading to a smaller MSSS at those lead times for the two-model mean. Both models indicate an improved skill at decadal scale from initialization, with the highest values in UKMO-DePreSys. As shown in Goddard et al. (2013), the MSSS is a function of both the correlation and the conditional bias, and the higher MSSS from initialization is mainly caused by a reduction of the conditional bias that is large in the uninitialized predictions.

### b. SST source of hurricane forecast skill

Our hurricane frequency index is based on SST averaged over the tropical Atlantic and over the global tropics [Eq. (1)], so both quantities are potential sources for the better predictability in the initialized forecasts. We can explore retrospective forecasts and skill measures of these two indices with the hope of finding the role each had in recovering the past history of hurricane activity (Fig. 4). Overall, there is no indication that retrospective forecasts of tropical-mean SST are improved by initializing the coupled GCMs (top, Fig. 4), with the relatively monotonic warming of the tropics dominating the observed and modeled signals. The dominance of the long-term trend in both SST indices cuts the effective degrees of freedom severely, to the point where for tropical-mean SST the interpretation of correlation as a skill metric is likely too ambiguous to be useful. The GFDL-DecPre system has marginally higher retrospective correlation in both SST indices than does UKMO-DePreSys, likely because of the inclusion of future volcanic information in its radiative forcing (Table 1). However, this nominally larger skill in GFDL-DecPre for the two SST indices does not translate into even a nominal increase of the hurricane forecasts (Fig. 2) because the volcanic signals are primarily spatially uniform. Across both model systems there is a consistent nominal improvement of retrospective correlation of Atlantic main development region (MDR) SST predictions from initialization, but the effect is small relative to the number of degrees of freedom. Only in the GFDL-DecPre does the initialized forecast of MDR SST approach a significant improvement over a persistence forecast. Because of the dominance of a quasi-monotonic trend, for tropical-mean SST all the forecast methods (initialized and uninitialized GCM forecasts and persistence) yield comparable results. For both SST indices all of the forecast methodologies lead to statistically significant retrospective correlations against a null of zero correlation, again largely because of the dominance of a trend.

The results in Fig. 4 suggest that the nominal improvement in retrospective correlation from initialization came from improvements to the forecast of Atlantic MDR SST. However, because the time series of each SST index includes a substantial component that is coherent across both indices, and because the hurricane frequency emulator is based on the difference between the two indices, interpreting the source of hurricane predictability from each index is not necessarily straightforward, as was noted in Vecchi et al. (2011). An alternative approach to assessing the influence of each index on the role of initialization on forecast skill is to use values of one index from the initialized experiments and the other from the uninitialized experiments. For example, taking values for SST_{MDR} from the initialized experiment, but keeping the SST_{TROP} from the uninitialized one, yields comparable hurricane retrospective forecast results (Fig. 5a) to when both indices are taken from the initialized experiments (Fig. 2). The impact of initialization on SST_{MDR} yields 5-yr mean fluctuations of this hurricane frequency index that show rather good agreement with observations for both models with a correlation of 0.70 and 0.59 in GFDL-DecPre and UKMO-DePreSys, respectively (both significantly different from zero correlation at *p* < 0.05) at lead 2–6. Using values for SST_{MDR} from the uninitialized experiments but those of SST_{TROP} from the initialized experiments leads to very different results (Fig. 5b). The correlation drops to 0.21 in GFDL-DecPre and to 0.43 in UKMO-DePreSys, with neither correlation significantly different from *ρ* = 0 (even at *p* < 0.2) and neither model able to reproduce the observed sharp increase in the mid-1990s. This indicates that the nominal improvement in correlation in the initialized multiyear predictions results from a better representation of the Atlantic main development region when initializing the coupled models, with little beneficial impact from initialized predictions of the global-mean tropical SST.

For the GFDL-DecPre system, the difference in retrospective correlation when swapping initialized/uninitialized SST_{MDR} and SST_{TROP} is significant at *p <* 0.1. Note in Fig. 5b that there is a large increase in hurricane frequency around 2005 in GFDL-DecPre, as in Fig. 1a. This increase, which we currently consider to be spurious, is a large contributor to the reduction in correlation from the impact of initialization on tropical-mean SST in the GFDL model. There is a coincidence between the global implementation of the Array for Real-Time Geostrophic Oceanography (Argo) drifting float profiles in 2003 and the spurious shift of 9-yr forecasts centered around 2005/06, suggesting that enhanced observational sampling after 2003 may have led to a change in the lead-dependent climatology. Experiments are underway to test this possibility. The lack of such a spurious increase in UKMO-DePreSys could arise from different initialization processes, or from the fact that the last initialized forecast in UKMO-DePreSys begins in 2006—so the late spike would not be evident. Were the introduction of Argo found to be the driver of this spurious increase, in addition to developing methods to minimize the impact of observing system changes, the impact of other large changes to the observing system would also have to be explored (e.g., the introduction of altimetry in the early 1990s and the completion of the TAO array in the mid-1990s).

### c. Role of the mid-1990s climate shift

The nominal improvement in skill from initialization should be interpreted with care. Even if the initialized retrospective predictions outperform climatology at almost all lead times (Fig. 3), the skill could still come from persistence—just persistence that cannot be captured with our observationally based persistence model. Figures 6a and 6b compare the retrospective predictions of hurricane frequency for 5-yr means ranging between leads 1–5 and 6–10. The forecasts at each lead show a tendency to have a systematic 1-yr shift with respect to the preceding lead, with the mid-1990s shift in each model trailing in time for longer leads rather than capturing the observed 1995 shift (e.g., Elsner et al. 2004; Li and Lund 2012) at the right time. By performing changepoint analysis (Pettitt test) on the models' retrospective predictions, we find a shift in forecasts initialized in 1991 in UKMO-DePreSys and forecasts initialized in 1995 in GFDL-DecPre. This tendency for forecasts to lock across the shift can be seen more clearly when the same time series are plotted as a function of initialization year instead of verification time (Figs. 6c,d): forecasts initialized the same year are very similar to each other, independent of when they verify. Notice that the mid-1990s shift for each model appears at the same initialization year for all lead times, as does the potentially spurious mid-2000s shift in GFDL-DecPre.

Up to now we have been largely comparing the results of forecasts initialized at different years at the same lead, without focusing on the evolution of hurricane counts of each forecast as the lead increases. A correct forecast of the mid-1990s climate shift would have indicated at some point prior to the shift that there was an increased probability of hurricane frequency increasing in time. For example, if a forecast initialized in early 1991 showed counts averaged in 1992–96 that were larger than those in 1991, or an increased number of ensemble members with large increases, one would have evidence for a future shift. Do these two forecast systems produce such a shift? Figure 7 shows that in the observational record, reflecting the rapid increase in frequency in 1995, the difference in hurricane counts averaged over the five years following the years 1991–94 exceeded the counts over each of those years by an unusually large amount, relative to the distribution over the 1961–2006 period. However, neither forecast system (colored lines in Fig. 7) shows a tendency for their forecasts to increase in time relative to the first forecast year when initialized in the early 1990s. In fact, there is a nominal tendency for these forecasts to decrease in time from the first forecast year, relative to the distribution of tendencies across all initialization dates, 1961–2006. That is, the models did not forecast a tendency toward higher frequency in the mid-1990s (Fig. 7), even though the sequence of forecast values exhibits a climate shift in the mid-1990s (Figs. 1 and 6).

To further highlight the influence of the mid-1990s shift on the retrospective skill estimation, we explore forecast performance after removing the mid-1990s shift from both the forecasts and the observations. The shift is “removed” by simply referencing each period before and after the 1994/95 shift to its own climatology; for instance, the time-mean hurricane count preceding 1995 is removed from all years before 1995, and the time-mean hurricane count following 1995 is removed from all years after 1995. We note that using each model's changepoint instead of 1995 does not affect the character of the results. Figures 8 and 9 indicate that removing the shift leads to a substantial reduction of correlation in the initialized predictions at lead 2–6 (particularly for UKMO-DePreSys PPE), and no indication of skill beyond that lead time, further confirming that the decadal signal is dominated by the trend that arises from the existence of the mid-1990s changepoint. Therefore, future real (as opposed to the retrospective forecasts explored here) multiyear and decadal predictions of hurricane frequency should not be expected to show the same skill as over the 1961–2011 period unless there are changepoints of similar character to the mid-1990s shift. Our results are encouraging for the feasibility of multiyear forecasts of hurricane frequency with the current prediction systems. However, this analysis highlights that substantial challenges remain—or, viewed more optimistically, that it is possible to improve the performance of the system beyond its current capability.

An interesting side effect of removing the mid-1990s shift is to increase the effective degrees of freedom, narrowing the confidence intervals associated with the point estimates of the correlation coefficient (cf. Figs. 2 and 9). In addition, the retrospective correlation in the uninitialized forecasts without changepoint disappeared—because it largely arose from the projection of the observed shift onto the models' forced trend over this period. In this modified context, there are now indications that for the GFDL model and the two-model ensemble the correlations (although lower than in the case including the shift; Fig. 2) are significantly higher than those of the uninitialized versions of the model at lead 2–6. That is, there is significant (at *p* < 0.1) indication that GFDL-DecPre and the two-model ensemble may be able to predict the types of variations in hurricane frequency that occurred in the early 1980s and early 1990s better than the uninitialized experiments. In Fig. 2, the nominal improvement from initialization in the correlation of the lead 2–6 and 6–10 mean hurricane counts in GFDL CM2.1 was larger than that for the lead 2–10 forecasts; this may reflect the ability of GFDL CM2.1 to retrospectively forecast some multiyear variations beyond the 1994/95 climate shift—which is the dominant signal in the 9-yr running counts. This further highlights the limitations of a data record that is short relative to the dominant time scales in order to assess the impact of multiyear forecast skill. While it is entirely possible that some of the nonsignificant differences between the initialized and uninitialized models shown in Figs. 2 and 3 could become significant from a longer record, it is also possible that the impact of initialization could also decrease and remain nonsignificant in a longer record.

## 4. Summary and discussion

Predictions of North Atlantic hurricane frequency were investigated in two global coupled models initialized toward estimates of the observed climate state. We find statistically significant retrospective correlation of multiyear to decadal initialized hurricane frequency forecasts by accounting for both initialization and radiative forcing changes. The two systems explored, GFDL-DecPre and UKMO-DePreSys PPE, show comparable skill. The two-model mean had the best skill, encouraging the pursuit of broader multimodel studies (e.g., Caron et al. 2013); lagged averages lead to nominal correlation increases. The retrospective correlations from initialized multiyear hurricane forecasts are comparable to those reported in S10 using an alternative methodology.

Taken together, our results and those of S10 indicate that initializing a climate model and accounting for radiative forcing changes, together, can lead to significant retrospective skill in multiyear hurricane forecasts (relative to climatological forecasts). The performance of the initialized forecasts was nominally better than that of uninitialized forecasts, both in correlation and in MSSS (Goddard et al. 2013). However, because of the short observational record and the persistent character of the time series, the confidence intervals associated with all the forecasts are large, and the difference between initialized and uninitialized forecasts is not statistically significant at *p* = 0.1 (although some are at *p* = 0.2). Because of the consistency of correlations across studies and the visual improvement, we hypothesize that lack of significant improvement from initialization may indicate of lack of “power” (i.e., the probability that the test will correctly reject the null hypothesis) by the statistical test (arising from too few degrees of freedom and a relatively strong correlation arising from radiative forcing alone) rather than a lack of effect of initialization (e.g., Johnson 1999). Additional years could lead to enhancement of our confidence; however, the large autocorrelation of the time series indicates that we require about seven years of data to gain a degree of freedom—so many years will be required to improve our confidence, even if we include the past 50 years in future estimates of forecast skill.

The observed time series of North Atlantic hurricane frequency is dominated by a strong and abrupt rise in 1995 leading to a trend over the 1961–2011 period. The high correlations of the retrospective predictions of North Atlantic hurricane frequency depend on the presence of this shift. While predictions from both models are for more hurricanes after the mid-1990s than before, the increase is not actually predicted by the evolution of the models, but is present in the initial state (i.e., forecasts initialized after the shift exhibited by each model remain high, but those initialized prior do not show the shift; Figs. 6 and 7). That is, the large retrospective skill estimates (Figs. 2 and 3) do not come from predicting the dynamical evolution of the climate system resulting in the hurricane frequency shift, but from “recognizing” that a climate shift has occurred and persisting that shift. This behavior mirrors experience in seasonal forecasts of El Niño, where transitions from climatological conditions to a warm ENSO state can be problematic to predict (e.g., Landsea and Knaff 2000; Vecchi et al. 2006), and successful forecasts often reflect the continued updating of subsurface conditions. This reduces our confidence that the onset of a similar shift in the near future could be successfully predicted with current prediction systems. It also highlights the need to better understand the origin of the changepoint in the observations and to assess whether the modeled mechanisms are consistent with those in the real world (e.g., Robson et al. 2012).

Despite high correlation values, the mean retrospective skill of these forecasts may provide a poor and even misleading guide to the future performance. In the absence of a major climate shift, like the 1994/95 shift, the long-term estimates of correlation (e.g., 0.6–0.9) are not representative, and the lower retrospective correlations assessed after removing the shift (e.g., 0–0.4; Figs. 8 and 9) may be closer to those one should expect.

Neither model system successfully predicts the highest values of observed 5-yr hurricane frequency that appear in the mid-2000s. GFDL-DecPre shows a comparable rise but 5–10 years later than observed, whereas UKMO-DePreSys shows a more modest increase with a several-year delay as well. Forecasts with GFDL-DecPre that extend past the present suggest an increase in hurricane frequency through the mid-2010s (Fig. 1). However, observations have been tending in the opposite direction, with recent years being less active than those in the mid-2000s. This period coincides with a fundamental change in the ocean observing system, with the global introduction of Argo floats after 2003 bringing considerably better coverage of the surface and subsurface ocean. Changes in observing systems have previously impacted the behavior of initialized forecasts, in part by changing the character of the initialized model's drift (e.g., Kumar et al. 2012); therefore the introduction of Argo could impact the lead-dependent climatology.

Thus, we hypothesize that this increase predicted by with GFDL-DecPre is spurious and reflects the impact of Argo data on the GFDL-DecPre drift. To test this hypothesis a set of experiments was performed in which Argo data were withheld from the initialization scheme of GFDL-DecPre after 2004. The predicted abrupt increase after 2004 is severely reduced when Argo is removed (Fig. 10), largely because of changes to model drift in regions that were poorly observed prior to Argo. These experiments support our hypothesis, so a more plausible prediction for the coming years is that shown in the left panel of Fig. 5, in which there is a tendency for relative stability to a reduction of hurricane frequency in coming years. Changes in drift (lead-dependent climatology) arising from the introduction of Argo impact the character of predictions of tropical- and global-mean temperature in the GFDL-DecPre system, leading to spuriously cold predictions of both if a single lead-dependent climatology is used to analyze the pre- and post-Argo period. We speculate that related errors may arise in other prediction systems from observing system changes. Methodologies to deal with the impact of observing system changes on drift must be developed in order to fully realize the potential of multiyear predictions; as the post-Argo record lengthens, motivated by Kumar et al. (2012), a potential solution is to use different lead-dependent climatologies for the pre- and post-Argo period. In addition, the impact of other observing system changes bear exploration, such as the introduction of the Pacific Tropical Atmosphere–Ocean moored buoy array in the early 1990s (McPhaden 1993) and expendable bathythermographs in the late 1960s. Interpretation of forecasts needs to be keenly constrained by our knowledge of changing observing practices both in the predictands (e.g., Vecchi and Knutson 2008, 2011; Landsea et al. 2010; Villarini et al. 2011a) and in the observations used to initialize the climate model (e.g., Zhang et al. 2007; Kumar et al. 2012).

Identifying the source of skill in retrospective predictions is key to the success of future forecasts. Recent studies (Mann and Emanuel 2006; Evan et al. 2009; S10; Villarini and Vecchi 2012b, 2013a) have argued that the recent (since the 1980s) increase of Atlantic hurricane activity was not caused by internal variability alone but also included an externally forced component driven largely by changing aerosol concentrations. Our results partially support this interpretation, indicating high correlations (significantly lead 2–10) in the uninitialized forecasts. Yet the sharp mid-1990s increase in Atlantic hurricane frequency is not retrospectively predicted in the uninitialized experiments. Its better representation in the initialized predictions could be interpreted as an indication of a key role for internal variability in the mid-1990s shift, supporting various studies (e.g., Zhang and Delworth 2005, 2006, 2009; Robson et al. 2012; Yeager et al. 2012; R. Msadek et al. 2013, unpublished manuscript). However, the nominal improvement from initialization could also reflect a failure in the radiative forcing/response in these models that is corrected when they are constrained with observations.

Our results indicate that the impact of initialization on forecasts of the Atlantic MDR relative to the tropics was key to the higher skill in the initialized forecasts (Figs. 4 and 5). Zhang and Delworth (2006) suggested that multiyear changes in hurricane activity could be driven by changes to the heat transport over the entire North Atlantic. S10 and Dunstone et al. (2011) further suggested that the subpolar North Atlantic was the main source of multiyear predictability of Atlantic hurricane frequency. The North Atlantic also stands out as the region where initialized forecasts outperform uninitialized ones in the GFDL model (A. Rosati et al. 2012, unpublished manuscript; Yang et al. 2012; R. Msadek et al. 2013, unpublished manuscript), suggesting a potential link between North Atlantic variability and Atlantic hurricane predictability in GFDL-DecPre. Further, Kang et al. (2008) showed that changes in the North Atlantic could lead to changes in atmospheric circulation over the tropical Atlantic in GFDL CM2.1. However, in our retrospective forecasts of hurricane activity, the relevant source of skill must have been present in tropical Atlantic SST—so any role for extratropical forcing must involve a subsequent change to tropical Atlantic SST. Thus, improved representation of processes controlling tropical Atlantic climate (e.g., Doi et al. 2012) is key to enhanced skill in forecasts of hurricane activity by systems like those used here.

## Acknowledgments

We are grateful to Doug Smith (UK Met Office) for making the UKMO-DePreSys PPE data available. We thank Ming Zhao and Tom Knutson for comments and suggestions.

## REFERENCES

*Continuous Univariate Distributions.*Vol. 2. Wiley, 752 pp.

_{2}-induced warming on simulated hurricane intensity and precipitation: Sensitivity to the choice of climate model and convective parameterization

**1,**479.

*Statistical Analysis in Climate Research.*Cambridge University Press, 484 pp.

_{2}with fixed sea surface temperatures

## Footnotes

This article is included in the North American Climate in CMIP5 Experiments special collection.