## 1. Introduction

Despite having advanced data assimilation systems for initialization, fully coupled ocean–atmosphere–sea ice dynamical models solving the physical equations of the climate system are generally known to be only slightly more skillful than statistical models in forecasting El Niño–Southern Oscillation (ENSO) phase and intensity (Balmaseda and Anderson 2009). In particular, general circulation models (GCMs) have problems in predicting boreal winter tropical Pacific sea surface temperature (SST) for forecasts starting in boreal spring (February–May); that is, they exhibit the so-called boreal spring predictability barrier (Flügel and Chang 1998; Jin et al. 2008; Philander et al. 1984). During the boreal spring, the intertropical convergence zone (ITCZ) is typically situated close to the equator with the climatological (seasonal) SST at a maximum (Lai et al. 2018). The combined effect is that ENSO SST anomalies (SSTA) and associated sea level pressure anomalies, measured by the Southern Oscillation index (SOI), are weakest during boreal spring, thereby reducing the signal-to-noise ratio and making forecasts more sensitive to random variability. During boreal spring, ENSO events are typically in a decaying phase with relatively weak zonal SST gradients, and thus small perturbations in SST can be amplified over a substantial region of the equatorial Pacific, making associated SSTA difficult to detect, let alone to forecast accurately (Jin et al. 2008).

Since 2002, routine ENSO forecasts, largely in terms of indices associated with the variability of SST anomalies in the equatorial Pacific, have been collated by the International Research Institute for Climate and Society (IRI) and published on their web page (http://iri.columbia.edu/climate/ENSO/currentinfo/SST_table.html). Both Barnston et al. (2012) and Tippett et al. (2012) reviewed the performance of the constituent models comprising the IRI dataset over the 2002–11 period. They found the highest skill for those forecasts initiated after the boreal spring predictability barrier for target seasons prior to the subsequent boreal spring; that is, forecasts were found to verify systematically better against observations at lead times earlier than the intended forecast targets. The IRI dataset has more recently been augmented by hindcasts (1982–2010) and real-time (2011–15) predictions generated as part of the North American Multimodel Ensemble (NMME) project (Kirtman et al. 2014). Barnston et al. (2017) assessed skill in the NMME in terms of the mean square error skill score and anomaly correlation. In a companion study, Tippett et al. (2019) assessed probabilistic forecasts of ENSO phase and amplitude in current NMME prediction systems, finding that regardless of model, forecast format, and skill metric, the boreal spring predictability barrier still explains much of the dependence of skill on target month and forecast lead.

O’Kane et al. (2019) described the development of variants of strongly coupled data assimilation (DA) systems based on ensemble optimal interpolation (EnOI) and ensemble transform Kalman filter (ETKF) methods. The assimilation system was first tested on a small paradigm model of the coupled tropical–extratropical climate system, then implemented for a coupled GCM. They assessed the impact of assimilating ocean observations on the atmospheric state analysis update via the cross-domain error covariances from the coupled-model background ensemble. Using the CSIRO Climate Analysis Forecast Ensemble (CAFE), they also conducted multiyear ENSO prediction experiments with a particular focus on the atmospheric response to tropical ocean perturbations examining the relationship between ensemble spread, analysis increments, and forecast skill over 2-yr lead times.

Specifically, they employed initial forecast perturbations generated from bred vectors (BVs) (Toth and Kalnay 1997) projecting onto disturbances at and below the thermocline with similar structures. They found that the error growth of these dynamical vectors leads ENSO SST phasing by 6 months. Once expressed at the surface, the dominant mechanism communicating tropical ocean variability to the extratropical atmosphere was found to be via tropical convection modulating the Hadley circulation. They concluded that BVs specific to tropical Pacific thermocline variability were the most effective choices for ensemble initialization and ENSO forecasting. They further assessed forecast skill by comparison of receiver operating characteristic (ROC) curves calculated from a large hindcast dataset (a total of 3696 forecast years), finding that the reduced spread observed at long lead times (out to 465 days), for forecasts initiated from perturbations restricted to the equatorial thermocline, was an indicator of increased skill.

In this paper, we extend the study of O’Kane et al. (2019) examining the utility of the CAFE forecasts in direct comparison to state-of-the-art climate forecast systems comprising the NMME. In the CAFE forecasts, initial perturbations specific to the thermocline of the equatorial oceans were applied to a common analyzed atmosphere–ocean state, so it is reasonable to assume that ensemble spread in ENSO forecasts is largely due to these disturbances. The NMME models are initialized with a variety of methods, but all include global perturbations to both ocean and atmospheric initial states. Thus, comparison of the CAFE and NMME models sheds light on the efficacy of targeting initial conditions to regions where the dynamics relevant to the appropriate spatiotemporal variability resides—in this case ENSO variability on seasonal time scales.

In section 2, we describe the NMME and CAFE data used. Section 3 briefly describes the CAFE configuration used to generate the initial forecast perturbations. Section 4 compares the skill of the CAFE forecasts to the NMME in terms of phase (section 4a) and using the random walk sign test (section 4b). Discussion and conclusions are in section 5.

## 2. Data

Throughout, we characterize ENSO state (amplitude, phase, and duration) by the Niño-4 index, that is, SST averaged over the equatorial Pacific region: 5°S–5°N, 160°E–150°W (Barnston et al. 1997). Monthly averages of the observed Niño-4 index for the period January 1982–August 2016 are computed based on HadISST data (Rayner et al. 2003). Forecast monthly averages of the Niño-4 index come from CAFE and the NMME.

The NMME consists of integrations with start dates from a hindcast period (1982–2010) and a real-time period (2011–15), although here we do not discriminate between hindcasts and forecasts, referring to both as forecasts but cognizant of the fact that forecast skill is typically lower than that of hindcasts. The NMME data (Kirtman et al. 2014) have been used in several recent studies of ENSO predictability (DelSole and Tippett 2014, 2016; Tippett et al. 2019; Barnston et al. 2017) and are available from the IRI Data Library (http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME). Specifically, the NMME is based on real-time intraseasonal to seasonal to interannual prediction systems [see Table 1 of Kirtman et al. (2014) for the references describing each operational system]. However, apart from setting a minimum lead time (9 months) and ensemble size (11 members), model configurations (i.e., resolution, version, physical parameterizations, initialization strategies, and ensemble generation strategies) are left open to the forecast providers, who use a wide variety of data assimilation and ensemble perturbation strategies. Monthly mean data are provided for the NMME on global grids of SST, 2-m temperature (T2m), and precipitation.

Given the limited mutual span of these datasets (February 2002 to December 2015), the most feasible approach for calculating anomalies that includes a bias correction is where model anomalies over the common period are computed relative to the (lead-time dependent) ensemble-mean model climatology over the same period using cross-validation.

To facilitate a reasonable comparison to the CAFE forecasts, we only consider that subset of NMME models where forecast lead times extending up to 12 months have been provided. These models (see Table 1) are the Canadian (CanCM3, CanCM4); GFDL (FLORA, FLORB, AE04), and Center for Ocean–Land–Atmosphere Studies (COLA) models. The initialization method, forecast length, and number of ensemble members vary by model; however, all models are initialized near the start of each month. In the results that follow we label the monthly averages of up to 12-month integrations as having lead times of 0, 1, … , 11 months so that the 0-month lead of a forecast with nominal start date in January is the January average, and so on.

Attributes of CAFE and NMME models. Common period considered is 2002–15.

*I*member ensemble forecasts over a total of

*Y*years, from each of the respective CAFE and NMME models, initialized each month (

*m*∈ January, February, … , December) over a given period of years (

*y*∈ 2002, 2003, …, 2015) such that

*N*=

*I*×

*Y*is the total number of forecasts initialized from a given month

*m*and lead time

*τ*. The SST Niño-4 bias for a given month

*m*at a given lead time (

*τ*∈ 1, 2, … , 12 months) is estimated as

## 3. CAFE data assimilation and initial forecast perturbations

Ensemble forecasts from CAFE were generated with lead times up to 24 months spanning the period of 2002–15. The CAFE system, including data assimilation, ensemble initialization, and ENSO skill in terms of ROC curves has been described in detail by O’Kane et al. (2019). Here, we extend that study to examine the value of targeting thermocline disturbances as initial conditions for long-range ENSO prediction relative to state-of-the-art forecast systems. A detailed description of the data assimilation and ensemble generation used to initialize the CAFE forecasts, including examination of the growth of errors and skill with respect to observations, is reported in O’Kane et al. (2019). In the interests of clarity and as background we now give a brief overview of ensemble initialization in CAFE describing only the details pertinent to the current discussion and refer the reader to O’Kane et al. (2019) for additional information. Note that the CAFE forecast data used here correspond to the F1 forecast data from O’Kane et al. (2019).

### a. Data assimilation

*k*-ensemble forecast anomalies

**z**

_{i}, defined as

*n*-dimensional in model space and where

*i*= 1, 2, …,

*k*, run over the entire ensemble. In CAFE, the

*T*) and salinity (

*S*) data. SSHA is derived from the Radar Altimeter Database System (RADS) altimetry (http://rads.tudelft.nl/rads/rads.shtml) and in situ

*T*and

*S*observations from Argo, expendable bathythermograph (XBT), and conductivity–temperature–depth (CTD) data, as well as TAO/TRITON (Pacific), PIRATA (Atlantic), and RAMA (Indian) ocean moorings from the Global Tropical Moored Buoy Array and from the World Meteorological Organization Global Telecommunication System (WMO GTS) (see http://www.wmo.int/pages/prog/www/TEM/GTS/index_en.html).

### b. Initial forecast perturbations

*ω*′(

*t*) =

*ω*(

*t*) + Δ

*ω*(

*t*), where

*ω*(

*t*) is the control state under the full nonlinear governing equations and Δ

*ω*is the perturbation of the model state from the unperturbed control. The perturbations themselves are rescaled to a given size

*ε*periodically at a time interval Δ

*T*as follows. The difference between control and perturbed trajectories Δ

*ω*(

*t*+ Δ

*t*) =

*ω*′(

*t*+ Δ

*t*) −

*ω*(

*t*+ Δ

*t*) is computed at times Δ

*t*=

*n*Δ

*T*for

*n*∈ 1, …,

*N*, whereupon the perturbation is rescaled and the perturbed system redefined as

*n*+ 1)Δ

*T*. The BV corresponds to the (finite) perturbation Δ

*ω*(

*t*) constructed at time

*t*via a straightforward rescaling of the forecast perturbation by a uniform factor

*T*is the rescaling interval and Δ

*ω*(

*t*) is the BV at time

*t*. More generally we may define the relative amplification factor in terms of the vector of gridpoint values of the BV of any climate variable field as

*ω*(Δ

*t*,

*t*) initiated at time

*t*and evolved to time Δ

*t*+

*t*. We take the

*L*

_{2}norm as the root mean square of the vector by

*S*based on the L2-norm for temperature at each level within the isosurface as

*i*refers to level,

*σ*

_{i}is the standard deviation of temperature at a given level

*i*calculated from the control simulation as described above. This norm is applied to all ocean prognostic state variables, and is a simplification of the multivariate approach of Cai et al. (2003).

The BVs are generated each month as the renormalized differences between the unperturbed (control) and the perturbed forecasts at 1-month lead time. The new ensemble of 10 BVs is then added to the analyzed ocean state. The 11-member ensemble forecasts are initialized each month from the 10 BVs and the analyzed (control) states. The atmosphere is initialized to a common state that has been constrained via cross-covariances with the ocean observations. As stated earlier, the rescaling amplitudes are based on the 1–2-month in-band variance isosurface specific to the equatorial thermocline. BVs within the isosurface are the only perturbations added to the analysis to generate the ensemble, and hence all disturbances originate from the tropical thermocline.

To briefly illustrate how BV perturbations initially at the thermocline evolve and finally contribute to the improved ocean surface forecast, we calculate the ensemble average monthly mean BV perturbations, that is, ensemble averages of 10 monthly mean ocean temperature forecasts differenced with respect to the control (unperturbed) forecast between ±15° latitude over the equatorial Pacific. In Fig. 1, we show the evolution of an initial disturbance located at about 95 m, and track the growth of that disturbance through the water column (depths of 95, 85, 55, 25, and 5 m) over lead times of 1, 3, 6, 7, 9, and 10 months from an initial start date of March 2015. Comparison to disturbances at the surface shown in Fig. 1 shows no coherent response until lead-time month 7 when the bred vectors express at the surface. Thus, an initial well-chosen disturbance can add additional information specific to the thermocline that expresses 6–7 months later, potentially adding important information about growing subsurface instabilities to initial conditions for forecasts initiated during the boreal spring and here demonstrated during the leadup to the 2016 El Niño.

The CAFE forecasts were developed to better understand the role of the Pacific thermocline in ENSO predictability and the extratropical atmospheric response to equatorial ocean disturbances. As we are generally interested in ENSO prediction at lead times from seasonal to interannual, we assume that predictability resides in the ocean and that atmospheric initial conditions are subdominant. O’Kane et al. (2019, their Fig. 8) showed that the maximum lag correlation between the Multivariate ENSO Index (MEI) (Wolter and Timlin 2011) and the growth rate of subsurface temperature disturbances in the equatorial Pacific occurred at approximately 150-m depth with a 6-month lag. Thus, thermocline disturbances lead SST by 6 months, indicating a dominant role for the subsurface dynamics at lead times beyond a season. Forecast ensemble plumes of the raw CAFE multiyear ENSO forecasts (O’Kane et al. 2019, their Fig. 14) initialized from January 2007 found that the member forecasts overshot the observed 2008 La Niña as a result of the model equatorial Pacific cold tongue bias reestablishing coincident with the onset of the spring predictability barrier. However, relative to other types of initial perturbations, CAFE forecasts initialized with perturbations specific to the growing disturbances local to the tropical Pacific thermocline evolved more coherently with reduced error growth and consequently reduced ensemble spread and with improved ENSO predictability. More generally, we propose that using an ensemble of states differing only by perturbations specifically tuned to tropical coupled instabilities, and hence relevant to ENSO, would be effective as forecast initial perturbations, and particularly so where ensemble sizes are limited. Similar approaches have been considered by Yang et al. (2006) and Frederiksen et al. (2010).

## 4. Comparison to NMME

In the subsequent calculations we examine the forecast skill of ENSO, in terms of the Niño-4 index, at given lead times from 0 to 12 months. As a reference, in Fig. 2 we show the ensemble average forecast Niño-4 index initialized each month for the CAFE and respective NMME models at lead times of 0, 3, 6, and 11 months. Apart from the systematic increase in spread as lead time increases, perhaps the most noticeable point of interest is each model’s prediction of both the maximum and phase for the 2009/10 El Niño and the minimum and phase of the subsequent 2010/11 La Niña is quite good even at lead times of 6 months. O’Kane et al. (2019, their Fig. 15) showed receiver operator characteristic (ROC) curves calculated for the Niño-4 index comparing 11-member BV ensemble forecasts started each month at lead times out to 2 years over the period 2003 through June 2017. The ROC curve (i.e., both hit rate and false alarm rate) is calculated for prediction of an occurrence (i.e., yes or no) of certain events where the Niño-4 anomaly >1°K. These results generate a statistical analysis of the accuracy of the CAFE forecasts with respect to the observed Niño-4 index with the CAFE forecasts better than random at lead times up to 465 days.

### a. Anomaly correlation coefficient

Here we use the anomaly correlation coefficient (ACC) to verify ENSO indices based on spatially averaged SST fields (Pearson 1895; Jolliffe and Stephenson 2011). Specifically, we apply ACC to SST in the Niño-4 region considering the correlation of anomalies of forecasts with verifying reference values from the HadISST dataset.

#### 1) Metric definition

*N*is the sample size and

*f*

_{i}and

*o*

_{i}respectively. More generally,

*w*

_{i}is the weighting coefficient (here equal to 1) such that

*F*

_{i},

*O*

_{i}, and

*C*

_{i}=

*C*represent individual forecast, verifying, or observed and reference values (i.e., climatological samples), respectively. When the variation pattern of the forecast anomalies coincides exactly with those of the verifying data, the ACC equals 1 or alternately −1 where the pattern is completely reversed. The ACC measures the correspondence or phase difference between forecast and observations, subtracting out the climatological mean at each point

*C*

_{i}, where sample mean values are subtracted for the centered ACC. The anomaly correlation is frequently used to verify output from numerical weather prediction (NWP) models. Importantly, the ACC is not sensitive to forecast bias, so it follows that a good anomaly correlation does not necessarily guarantee accurate forecasts, hence the common practice of employing a large hindcast dataset to debias forecast data. Here, we focus on skill in predicting ENSO phase verified by the metric ACC, recognizing that verification will be subject to the usual limitations of finite sample sizes, even in the period studied here, due to the relatively few actual El Niño and La Niña events.

Statistical significance levels are calculated to test the significance of positive correlations and differences. Here we use the fraction of observed negative values as a *p* value, and compare this to chosen significance level—here the 95th percentile or the 5% level. Nonparametric bootstrapped distributions are constructed as a function of lead time and initial months from randomly sampled (without replacement) initial years and ensembles used to build the mean. Here we use 100 bootstrap resamplings. A similar approach is described in detail in Goddard et al. (2013).

#### 2) Results

We now examine predictability specifically in terms of ENSO phase, as determined by ACC in CAFE relative to the NMME forecasts applying all caveats regarding model biases. We stress that the CAFE ensemble forecasts focus nearly exclusively on the predictability arising from subsurface equatorial ocean dynamics and that only the large scales of the atmosphere have been weakly constrained to the current climate through cross-domain correlations with ocean observations. As the CAFE forecasts were not initialized with the observed synoptic features present at the time of the forecast, one does not expect them to be more skillful (relative to NMME) over the first few forecast months when knowledge of the particular state of the synoptic and faster time scale atmospheric processes determine a significant fraction of the potential predictability. Rather, as shown by O’Kane et al. (2019), it is when the predictability associated with the thermocline expresses at the surface (at lead times of 6 months and longer) that a signal (i.e., enhanced predictability) emerges.

We first calculate the anomaly correlations of the CAFE and NMME forecasts in relation to the observed Niño-4 index (Figs. 3a,d,f,h,j,l,n). Here the black dots indicate significant correlations at the 95th-percentile level. The black lines indicate ACC values for forecasts initiated in March (1-month lead) through December (10-month lead). The boreal spring predictability barrier is clearly evident in the HadISST lagged autocorrelation (Fig. 3b). This period of reduced potential predictability is in contrast to the boreal autumn where significant positive values (*r* > 0.9) in the autocorrelation extend out to 12-month lead (positive lag), indicating that the potential predictability is at a maximum. Unsurprisingly, all NMME models exhibit the lowest ACC values at lead times greater than 6 months for those forecasts with target months in between June and December corresponding to the months and lags where the HadISST autocorrelation is low. The CAFE ACC indicates best skill for forecasts initialized during the boreal spring, summer, and autumn and in particular beyond 12 months (not shown) for forecasts initialized in March–April.

The CAFE model configuration displays a general “cold tongue” bias in SST whereby the major region of variability in the equatorial Pacific is displaced to the west of the observed maximum. This bias impacts the modeled ENSO phase locking and is the major source of error evident in CAFE forecasts initiated in December at lead times of 5–8 months (i.e., corresponding to target months June–August) (Fig. 3a). The CanCM3, FLORA, FLORB, and COLA models are most impacted by the boreal spring barrier at longer lead than 6 months. At lead times less than 6 months, CanCM4 is the best performing model in terms of ACC.

To quantify the differences in skill between the CAFE and NMME forecasts, we take the difference in their respective ACC values from that of the CAFE forecasts; see Figs. 3c,e,g,i,k,m, where red (blue) indicates larger (smaller) ACC values for CAFE relative to NMME. The CAFE forecasts are less skillful than the NMME forecasts for lead times shorter than 6 months and forecasts initiated in the boreal winter. Again, the solid black lines indicate skill of forecasts initiated in March out to 10-month lead (i.e., December). Forecast skill for the target months (read in the horizontal direction) of May, June, July, and August generally degrades at lead times beyond 6 months as reflected in the anomaly correlation with HadISST (Fig. 3a) and arises due to the aforementioned model bias [also discussed in O’Kane et al. (2019)]. That said, relative to the CanCM3, FLORA, FLORB, and COLA (Figs. 3c,i,k,m) forecasts, the CAFE forecasts are generally more skillful for target months beyond 6–8-month lead. As CAFE employs a variant of the GFDL CM2.1 coupled general circulation model, the differences between the GFDL AER04 configuration (Fig. 3g) are presumably less pronounced than those relative to the FLORA and FLORB configurations (Figs. 3i,k). That said, CAFE exhibits improvements in skill over the GFDL configurations at longer lead times in common with comparisons to the other NMME models with the exception of CanCM4. Interestingly, we found a reasonable correlation between increased CAFE skill relative to the individual NMME models and regions of reduced HadISST autocorrelation values.

The general increase in ACC with respect to the individual models is reflected in the difference between the multimodel ensemble-mean ACC with and without inclusion of the CAFE forecasts (Fig. 4). Here we calculate the ACC for five NMME models plus the CAFE forecasts and take the difference with the ensemble-mean ACC calculated from the complete (six member) NMME. This is repeated excluding each individual NMME model in turn (Figs. 4a–f). Last, we then take the average over all six difference calculations in order to produce an average difference plot (Fig. 4g). This process gives an estimate of the impact on ACC when the CAFE forecasts are included relative to the NMME without prejudice of any particular model. Here we note that even for lead times shorter than 6 months, with the exception of forecasts initiated in December, the combined multimodel ensemble is positively impacted by inclusion of the CAFE forecasts, again with the improvements in target months and lags occurring where the corresponding HadISST autocorrelation values are low. These results are robust and consistent across all model combinations.

A prominent feature emerges from this analysis, namely, that the NMME models are in general most skillful at 3–6-month lead time for target months between October and December, whereas the CAFE forecasts, relative to the NMME (1982–2015), are generally more skillful at lead times longer than 6–8 months and in particular target months corresponding to the boreal spring barrier. These results indicate the importance of atmospheric initial conditions on ENSO predictability at shorter lead times of a few months and the more important role of the emergent signal from thermocline disturbances at longer lead times beyond two seasons into the future.

### b. Sign test

We further apply a procedure proposed by DelSole and Tippett (2016) based on random walks and formally equivalent to the sign test, here applied to the Niño-4 index. The method is independent of distributional assumptions about the forecast errors and, while assuming that individual forecasts are independent, can provide useful information even where serial correlations in time are present. Additionally, the method requires only relatively few years of data to detect significant differences in the skill of models where biases are known a priori.

#### 1) Metric definition

*N*forecasts

*A*and

*B*, where

*K*denotes the number of times

*A*is more skillful than

*B*. Assume each time step is an independent Bernoulli (random) trial (i.e., there is no correlation between consecutive events). The underlying null hypothesis is that forecast

*A*is equally likely to be more or less skillful than

*B*. In this case

*K*should follow a binomial distribution with

*p*= ½; thus,

*p*value is given by

*K*

_{0}and the factor 2 accounts for the test being two-tailed (i.e., we do not know a priori which forecast is best). The null hypothesis is rejected if

*p*

_{value}falls below a prescribed significance level

*α*.

*K*

_{0}such that

*p*

_{value}≥

*α*for

*α*= 5%. If

*A*is more skillful than

*B*, a positive step is taken. If

*B*is more skillful than

*A*, then a negative step is taken. The case

*A*=

*B*is assumed never to occur and if it does occur, then no steps are taken. Thus, there are

*K*steps in the positive direction and

*N*−

*K*in the negative direction and

#### 2) Results

We now apply the random walk sign test as a general method to compare the skills of two predictions and where the evaluation of mean square error here partially serves to evaluate the amplitude of ENSO. After applying bias correction, as described in section 2, the sign test random walk is initiated from February 2002 through December 2015. We consider skill in monthly values of the Niño-4 index; that is, the count increases by 1 when the squared error of CAFE (model *B*) Niño-4 index is larger than that for any individual NMME model *A*_{i∈NMME(1),NMME(2), …, NMME(6)} (*A*_{i} is defined in Table 1) and decreases by 1 otherwise. Throughout the assumption is *A*_{i} ≠ *B*(∀_{i}). The count is accumulated forward in time for each model separately, over all initial months in a given season and over the years between 2002 and 2015 (for a fixed lead time), thereby tracing out a random walk. In Fig. 5, the white region indicates the range of counts that would be obtained 95% of the time under independent random (Bernoulli) trials for *p* = 0.5 (i.e., models in this range are statistically indistinguishable). A random walk extending into the blue (tan) shaded area indicates that CAFE forecasts are less (more) significantly skillful more often than expected for independent random trials; that is, CAFE is further from (closer to) the observation. Changes in the average slope of the random walk can be an indication of a systematic change in skill. Cognizant of the seasonal dependence for ENSO forecasts, we partition the count by season.

In Fig. 5 and the text that follows, we refer to the respective seasons using the first letter of each month (i.e., DJF refers to the boreal winter months of December–February and similarly for the other seasons). Despite being explicitly constructed to project onto ENSO predictability at lead times beyond 6 months, at 3-month lead (Fig. 5, left column) the CAFE forecasts are statistically indistinguishable from the majority of NMME forecasts initialized in JJA and SON; less skillful than COLA, CanCM3, and CanCM4 for DJF; and only less skillful than COLA for MAM forecasts. At 6-month lead (Fig. 5, center column) for DJF forecasts, CAFE is statistically less skillful than all NMME models with the exception of CanCM4 and AER04, a result consistent with the poor phase locking evident in Fig. 3a for CAFE forecasts initialized in December at 6-month lead time. In the remaining seasons (MAM, JJA, and SON), there is little to distinguish CAFE and NMME forecasts at 6-month lead. At 11-month lead (Fig. 5, right column) for DJF, there is evidence that CAFE is slightly more skillful than COLA and CanCM3; however, skill, as measured by the sign test, is comparable across models.

## 5. Discussion and conclusions

We have shown that, during the boreal spring, the difficulties associated with detecting the relevant and often weak SST anomalies associated with specific ENSO events may be mitigated to some extent by targeting disturbances about the tropical Pacific thermocline. In O’Kane et al. (2019), it was demonstrated that the variability associated with thermocline disturbances typically leads the SST variability by about 6 months, thereby providing a mechanism for extended predictability. They focused on the methods and mechanisms by which ensemble forecasts may be initiated using nonlinearly modified dynamical vectors to span the local low-dimensional subspace where variance, partitioned into a specific spatiotemporal band relevant to ENSO, resides.

In this follow-up study, we have compared multiyear ensemble ENSO forecasts from the CAFE system to ensemble forecasts from state-of-the-art dynamical coupled models in the NMME project. Our analysis of ACC as a standard metric of forecast skill in terms of ENSO phase largely negates the biases specific to each of the respective models. We find that, relative to the respective NMME forecasts, the CAFE forecasts display increased ACC values at lead times greater than 6 months with increased ACC values largest for target months when predictability is most strongly limited by the boreal spring barrier. Comparison of ensemble-mean ACC values with and without the CAFE forecasts clearly shows the utility of augmenting current initialization methods with initial perturbations based on targeting instabilities specific to the tropical Pacific thermocline for ENSO prediction beyond a season. These results show that the inclusion of CAFE to NMME increases the ACC for most lead times and target months and that CAFE brings rather independent information to the NMME.

Next, the random walk sign test was used to measure of forecast skill. Although less skillful at 3-month lead than a seasonally dependent subset of NMME models, the relative skill of the CAFE forecasts progressively increases against all NMME models with lead time such that at 11-month lead CAFE is at least as skillful as any particular NMME model. This is an important result given that the CAFE ensemble forecasts were initialized with only a highly targeted initial perturbation at the equatorial Pacific thermocline. Previous calculations of receiver operator characteristic (ROC) curves for Niño-4 forecasts, when calculated over all start dates and lead times out to 24 months (Fig. 15 of O’Kane et al. 2019), revealed skill out beyond 400 days. In those calculations, the ROC curve (i.e., both the hit rate and the false alarm rate) was calculated for prediction of an occurrence; i.e., yes or no) of Niño-4 events where the Niño-4 anomaly exceeds 1°C.

Taken as a whole, our results suggest that augmenting current initialization methods with initial perturbations targeting instabilities specific to the tropical Pacific thermocline generally improves ENSO prediction beyond 6-month lead time. More generally, it is reasonable to assume that the predictability of specific climate teleconnections may be targeted by judicious projection of initial ensemble perturbations onto the relevant instabilities responsible for determining variability at the spatiotemporal scales of interest.

The authors were supported by the Australian Commonwealth Scientific and Industrial Research Organisation (CSIRO) Decadal Climate Forecasting Project (https://research.csiro.au/dfp).

## REFERENCES

Balmaseda, M. A., and D. Anderson, 2009: Impact of initialization strategies and observations on seasonal forecast skill.

,*Geophys. Res. Lett.***36**, L01701, https://doi.org/10.1029/2008GL035561.Barnston, A. G., M. Chelliah, and S. B. Goldenberg, 1997: Documentation of a highly ENSO-related SST region in the equatorial Pacific: Research note.

,*Atmos.–Ocean***35**, 367–383, https://doi.org/10.1080/07055900.1997.9649597.Barnston, A. G., M. K. Tippett, M. L. L’Heureux, S. Li, and D. G. DeWitt, 2012: Skill of real-time seasonal ENSO model predictions during 2001–11. Is our capacity increasing?

,*Bull. Amer. Meteor. Soc.***93**, 631–651, https://doi.org/10.1175/BAMS-D-11-00111.1.Barnston, A. G., M. K. Tippett, M. Ranganathan, and M. L. L’Heureux, 2017: Deterministic skill of ENSO predictions from the North American multimodel ensemble.

,*Climate Dyn.***53**, 7215–7234, https://doi.org/10.1007/S00382-017-3603-3.Cai, M., E. Kalnay, and Z. Toth, 2003: Bred vectors of the Zebiak–Cane model and their potential application to ENSO prediction.

,*J. Climate***16**, 40–56, https://doi.org/10.1175/1520-0442(2003)016<0040:BVOTZC>2.0.CO;2.Cooley, J. W., P. A. W. Lewis, and P. D. Welch, 1969: The fast Fourier transform and its applications.

,*IEEE Trans. Educ.***12**, 27–34, https://doi.org/10.1109/TE.1969.4320436.DelSole, T., and M. K. Tippett, 2014: Comparing forecast skill.

,*Mon. Wea. Rev.***142**, 4658–4678, https://doi.org/10.1175/MWR-D-14-00045.1.DelSole, T., and M. K. Tippett, 2016: Forecast comparison based on random walks.

,*Mon. Wea. Rev.***144**, 615–626, https://doi.org/10.1175/MWR-D-15-0218.1.Evensen, G., 2003: The ensemble Kalman filter: Theoretical formulation and practical implementation.

,*Ocean Dyn.***53**, 343–367, https://doi.org/10.1007/s10236-003-0036-9.Flügel, M., and P. Chang, 1998: Does the predictability of ENSO depend on the seasonal cycle?

,*J. Atmos. Sci.***55**, 3230–3243, https://doi.org/10.1175/1520-0469(1998)055<3230:DTPOED>2.0.CO;2.Frederiksen, J. S., C. S. Frederiksen, and S. L. Osbrough, 2010: Seasonal ensemble prediction with a coupled ocean–atmosphere model.

,*Aust. Meteor. Ocean J.***59**, 53–66, https://doi.org/10.22499/2.5901.007.Goddard, L., and Coauthors, 2013: A verification framework for interannual-to-decadal predictions experiments.

,*Climate Dyn.***40**, 245–272, https://doi.org/10.1007/s00382-012-1481-2.Jin, E. K., and Coauthors, 2008: Current status of ENSO prediction skill in coupled ocean–atmosphere models.

,*Climate Dyn.***31**, 647–664, https://doi.org/10.1007/s00382-008-0397-3.Jolliffe, I. T., and D. B. Stephenson, 2011:

*Forecast Verification: A Practitioner’s Guide in Atmospheric Science*. 2nd ed. Wiley, 292 pp.Kirtman, B. P., and D. Min, 2009: Multimodel ensemble ENSO prediction with CCSM and CFS.

,*Mon. Wea. Rev.***137**, 2908–2930, https://doi.org/10.1175/2009MWR2672.1.Kirtman, B. P., and Coauthors, 2014: The North American Multi-Model ensemble (NMME): Phase-1 seasonal to interannual prediction; Phase-2 toward developing intra-seasonal prediction.

,*Bull. Amer. Meteor. Soc.***95**, 585–601, https://doi.org/10.1175/BAMS-D-12-00050.1.Lai, A. W.-C., M. Herzog, and H.-F. Graf, 2018: ENSO forecasts near the spring predictability barrier and possible reasons for the recently reduced predictability.

,*J. Climate***31**, 815–838, https://doi.org/10.1175/JCLI-D-17-0180.1.O’Kane, T. J., P. R. Oke, and P. A. Sandery, 2011: Predicting the East Australian current.

,*Ocean Modell.***38**, 251–266, https://doi.org/10.1016/j.ocemod.2011.04.003.O’Kane, T. J., and Coauthors, 2019: Coupled data assimilation and ensemble initialization with application to multiyear ENSO prediction.

,*J. Climate***32**, 997–1024, https://doi.org/10.1175/JCLI-D-18-0189.1.Pearson, K., 1895: Notes on regression and inheritance in the case of two parents.

,*Proc. Roy. Soc. London***58**, 240–242, https://doi.org/10.1098/rspl.1895.0041.Philander, S. G. H., T. Yamagata, and R. C. Pacanowski, 1984: Unstable air–sea interactions in the tropics.

,*J. Atmos. Sci.***41**, 604–613, https://doi.org/10.1175/1520-0469(1984)041<0604:UASIIT>2.0.CO;2.Rayner, N. A., D. E. Parker, E. B. Horton, C. K. Folland, L. V. Alexander, D. P. Rowell, E. C. Kent, and A. Kaplan, 2003: Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century.

,*J. Geophys. Res.***108**, 4407, https://doi.org/10.1029/2002JD002670.Saha, S., and Coauthors, 2006: The NCEP Climate Forecast System.

,*J. Climate***19**, 3483–3517, https://doi.org/10.1175/JCLI3812.1.Tippett, M. K., A. G. Barnston, and S. Li, 2012: Performance of recent multimodel ENSO forecasts.

,*J. Appl. Meteor. Climatol.***51**, 637–654, https://doi.org/10.1175/JAMC-D-11-093.1.Tippett, M. K., M. Ranganathan, M. L. L’Heureux, A. G. Barnston, and T. DelSole, 2019: Assessing probabilistic predictions of ENSO phase and intensity from the North American Multimodel Ensemble.

,*Climate Dyn.***53**, 7497–7518, https://doi.org/10.1007/S00382-017-3721-Y.Toth, Z., and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method.

,*Mon. Wea. Rev.***125**, 3297–3319, https://doi.org/10.1175/1520-0493(1997)125<3297:EFANAT>2.0.CO;2.Wolter, K., and M. Timlin, 2011: El Niño/Southern Oscillation behaviour since 1871 as diagnosed in an extended multivariate ENSO index (MEI.ext).

,*Int. J. Climatol.***31**, 1074–1087, https://doi.org/10.1002/joc.2336.Yang, S. C., M. Cai, E. Kalnay, M. Rienecker, G. Yuan, and Z. Toth, 2006: ENSO bred vectors in coupled ocean–atmosphere general circulation models.

,*J. Climate***19**, 1422–1436, https://doi.org/10.1175/JCLI3696.1.