Real-time model predictions of ENSO conditions during the 2002–11 period are evaluated and compared to skill levels documented in studies of the 1990s. ENSO conditions are represented by the Niño- 3.4 SST index in the east-central tropical Pacific. The skills of 20 prediction models (12 dynamical, 8 statistical) are examined. Results indicate skills somewhat lower than those found for the less advanced models of the 1980s and 1990s. Using hindcasts spanning 1981–2011, this finding is explained by the relatively greater predictive challenge posed by the 2002–11 period and suggests that decadal variations in the character of ENSO variability are a greater skill-determining factor than the steady but gradual trend toward improved ENSO prediction science and models. After adjusting for the varying difficulty level, the skills of 2002–11 are slightly higher than those of earlier decades. Unlike earlier results, the average skill of dynamical models slightly, but statistically significantly, exceeds that of statistical models for start times just before the middle of the year when prediction has proven most difficult. The greater skill of dynamical models is largely attributable to the subset of dynamical models with the most advanced, highresolution, fully coupled ocean–atmosphere prediction systems using sophisticated data assimilation systems and large ensembles. This finding suggests that additional advances in skill remain likely, with the expected implementation of better physics, numeric and assimilation schemes, finer resolution, and larger ensemble sizes.

The low predictability of the past decade masked a gradual improvement of ENSO predictions, with skill of dynamical models now exceeding that of statistical models.

During the last two to three decades, one might reasonably expect our ability to predict warm and cold episodes of the El Niño–Southern Oscillation (ENSO) at short and intermediate lead times to have gradually improved. Such improvement would be attributable to improved observing and analysis/assimilation systems, improved physical parameterizations, higher spatial resolution, and better understanding of the tropical oceanic and atmospheric processes underlying the ENSO phenomenon (e.g., Guilyardi et al. 2009).

Studies in the 1990s showed real-time ENSO prediction capability at a moderate level, with forecast versus observation correlations of about 0.6 for 6-month lead predictions (i.e., 6 months between the time of the forecast and the *beginning* of the predicted period) of 3-month mean conditions (Barnston et al. 1994). At that time, dynamical and statistical models showed comparable skills. The lack of conclusive ability for dynamical models to outperform statistical models was also found in predictions of the very strong El Niño of 1997/98 (Landsea and Knaff 2000; Barnston et al. 1999). Predictions at the 0.6 skill level are useful but leave much to be desired. The performance of statistical predictions was considered because they are simpler and less expensive to develop and serve as a baseline reference against which the skill of the more complex dynamical models can be compared.

Beginning early in 2002, predictions from a large number of models for the sea surface temperature (SST) in the Niño-3.4 region (5°N–5°S, 120°–170°W; Barnston et al. 1997) have been collected and displayed each month on a graph called the “ENSO prediction plume” (Fig. 1), on an International Research Institute for Climate and Society (IRI) web page (**http://iri.columbia.edu/climate/ENSO/currentinfo/SST_table.html**) and shown in the National Oceanic and Atmospheric Administration (NOAA) Climate Prediction Center's (CPC's) monthly ENSO discussion. This paper reviews the performance of the constituent models and attempts to discern changes (ideally, improvements) from the levels seen in the earlier studies. We reexamine the question of the relative performance of dynamical and statistical models, and also compare the skills of 9 yr of real-time predictions to those of longer-term (30-yr) hindcasts from some of the same models, and discuss possible explanations for their differences.

An overview of the data, methods, and ENSO prediction models is provided in section 2. Results are shown and examined in section 3, and a discussion and some conclusions are given in section 4.

## DATA, METHODS, AND MODELS.

### Data.

The ENSO predictions issued each month from February 2002 through January 2011 are examined here for multiple lead times for future 3-month target (i.e., predicted) periods. The last target period is January–March 2011, while the earliest target period is February–April 2002 for the shortest lead time and October–December 2002 for the longest lead time.

The forecast data from a given model consist of a succession of running 3-month mean SST anomalies with respect to the climatological means for the respective predicted periods, averaged over the Niño-3.4 region. Predicted periods begin with the 3-month period beginning immediately after the latest available observed data, and continue for increasing lead times until the longest lead time provided by the given model, to a maximum of nine running 3-month periods. Here, lead time is defined by the number of months of separation between the latest available observed data and the beginning of the 3-month forecast target period. (For example, using observed data through March, a prediction for the April–June season has a lead time of 0 months, for May–July a lead of 1 month, etc.) Typically new predictions become available one to two weeks following the last available month of observed data, so that the 0-month lead prediction for the April–June season becomes available during mid-April.

Although anomalies were requested to be with respect to the 1971–2000 climatology, some prediction anomalies were with respect to means of other periods, such as from 1982 to the early 2000s for some dynamical predictions. No attempt was made to adjust for these discrepancies.^{1} Similarly, although bias correction was encouraged (some centers conducted statistical corrections on their model output), no biases were corrected by the IRI, and the forecasts were used as disseminated by the producers. All of the dynamical models produce an ensemble of predictions, representing a probability distribution of outcomes. Although these distributions can be verified probabilistically, here we only consider the ensemble mean as a deterministic prediction. This approach enables the dynamical models to be verified in the same way as the statistical models, most of which provide only a single prediction.^{2}

The Reynolds–Smith (Reynolds et al. 2002) version 2 optimal interpolation (OI) observed SST data averaged over the Niño-3.4 region is used to verify the model ENSO predictions. The OI SST has a base period of late 1981 to present, and the 1981–2010 period is used here to define the verification anomalies.

### Methods.

The verifications conducted here focus on the performance of individual models, rather than on multimodel mean predictions as examined in Tippett et al. (2012). Accordingly, detection of skill differences between dynamical and statistical models is carried out through aggregating the skills (as opposed to the predictions) of the models of each type. Verifications are applied to real-time predictions over the 9-yr period, as well as to longer-term hindcasts of some of the models, to compare the performance between the two settings and two base periods.

Verification measures used include the temporal correlation, root mean squared error, bias, and standard deviation ratio. Applied to each of the several lead times, the measures are used both for all predictions over the 9-yr period and for seasonally stratified predictions. An additional diagnostic is the lag correlation between forecasts and observations, to detect systematic tendencies for predictions intended for a given lead time to verify with higher skill at other lead times. This diagnostic will reveal a tendency of most models to be late in forecasting ENSO state transitions, such that predictions verify better on the observations at lead times earlier than those intended.

### The ENSO prediction models.

The 20 models whose real-time predictions are evaluated here include 12 dynamical models and 8 statistical models, as shown in Table 1.^{3} The statistical models are developed using historical datasets and include various forms of regression (some based on autocorrelations or transition statistics), neurological networks, or analogues. The dynamical models, based primarily on the physical equations of the ocean–atmosphere system, range from relatively simple and abbreviated physics to comprehensive fully coupled or anomaly coupled models. Some models were introduced during the course of the study period, or replaced a predecessor model. Many of the dynamical models have been upgraded throughout the study period, while the statistical models have remained more constant. The model names used here refer to the model versions in early 2011. A brief guide to each of the models, with key references, is provided in the appendix (available online at **http://dx.doi.org/0.1175/BAMS-D-11-00111.2**).

## RESULTS.

### ENSO variability during the 2002–11 period.

The 9-yr study period is too short for many findings to be statistically robust, but long enough for some results to be suggestive and warrant further exploration. Substantial sampling errors for a 9-yr period are expected for both model behavior and observed ENSO behavior. To assess qualitatively whether the ENSO variability during the 9-yr study period is approximately comparable to that of a multi-decadal period, features of the time series of observed seasonal Niño-3.4 SST anomalies during 2002–11 are compared with those of 1981–2011 (Fig. 2). During 2002–11, at least moderate strength El Niño events occurred in 2002/03 and 2009/10, and likewise for La Niña events in 2007/08 and 2010/11. Weaker El Niño events occurred in 2004/05 and 2006/07, and borderline La Niña events^{4} occurred in 2005/06 and 2008/09. The Northern Hemisphere autumn/winter of 2003/04 was the only peak ENSO season having neutral ENSO conditions.

Table 2 shows the seasonal march of observed interannual standard deviation of Niño-3.4 SST for the 1981–2011 period compared with the 2002–11 study period, and the mean anomaly of the study period. The two profiles are fairly similar, with the study period showing somewhat lower variability, particularly during the middle of the calendar year. This smaller variability is likely related to the lack of very strong events such as the El Niño events of 1982/83 and 1997/98. A lack of a substantial upward trend within the 1981–2011 period is noted in Fig. 2, so that the higher standard deviation of the longer period cannot be attributed to a trend. The mean anomaly of the study period compared with the 1981–2010 period is weakly negative (positive) in the first (second) half of the calendar year (Table 2). A chi-square test indicates that the difference in standard deviation between 2002–11 and 1981–2011 is not statistically significant, with the strongest two-sided *p* value at 0.15 for the April–June season. Figure 3 shows autocorrelation of Niño-3.4 SST as a function of lead time and target season for the 1981–2011 period compared with the 2002–11 study period. The 2002–11 period has higher autocorrelation than the 1981–2011 period at short and intermediate lag times for target periods near the end and very beginning of the calendar year, indicating a persistence of SST anomalies during the time of year of typical ENSO event maturity. This may be related to the high proportion of the years between 2002 and 2011 having nonneutral ENSO conditions. On the other hand, autocorrelations are more strongly negative during 2002–11 than 1981–2011 at long lead for periods traversing the April–June period, consistent with an enhanced biennial variability during the study period. In fact, alternations of phase between northern autumn/early winter of consecutive years occurred in five of the eight year-toyear transitions, while a continuation of the same sign of anomaly occurred in three transitions. Using the 1981–2011 autocorrelations as population values and applying a Fisher *Z* test of the difference between them and the corresponding 9-yr autocorrelations, none of the latter are outside of the 95% confidence interval about the longer period values. Hence, the visible differences in the profiles of the two periods may be attributed to the expected sampling variability of a 9-yr period.

Whether the autocorrelation structure of the 2002–11 test period renders it more or less predictable than a period with a structure like that of 1981–2011 is an open question. It will be shown below, however, that the lower variability of the 9-yr period reduces its predictability.

### Real-time predictive skills of individual models.

Time series of the running 3-month mean observed SST anomalies in the Niño-3.4 region and the corresponding predictions by 23 prediction models at 0-, 2-, 4- and 6-month lead times are shown in Fig. 4. Figure 4 shows that the models generally predicted the variations of ENSO with considerable skill at short lead times, and decreasing skill levels with increasing lead times. A feature seen in Fig. 4 is a tendency for most of the models to predict the continuation of SST anomalies beyond their observed periods, and to do so to a greater extent for longer lead times (shown by color shading that tilts to the right with increasing lead time). False alarms have also occurred (e.g., some models predicted La Niña for 2003/04, which did not happen) but these have been less common than prolonged anomaly persistence. Figure 4 indicates forecast differences among the models that become increasingly pronounced with increasing lead time. These differences raise the question of whether some models are systematically more skillful than others.

#### Correlation and root mean squared error.

Figure 5 shows the temporal correlation between model predictions and the corresponding observations as a function of target season and lead time, with a separate panel for each model. The correlation skill patterns of the models appear roughly comparable. All indicate a northern spring predictability barrier, with short lead prediction skills having a relative minimum for northern summer, extending to later seasons at longer lead times. Relative to the statistical models, Fig. 5 shows higher correlation skills by many of the dynamical models for seasons in the middle of the calendar year that generally have lowest skill. By contrast, for seasons having highest skills (e.g. northern winter target seasons at short to moderate lead times), skill differences among models and between model types appear small.

Figure 6 shows individual model correlation skills as a function of lead time for all seasons combined, while the top and bottom panels of Fig. 7 show skills for the pooled target seasons of NDJ,^{5} DJF, and JFM, and for MJJ, JJA, and JAS, respectively. Overall, model correlation skills at 6-month lead range anywhere from 0.1 to about 0.7 for all seasons combined, while predictions for the northern winter season range from 0.5 to 0.9, and for the northern summer season from below zero to 0.55. Overall, for lead times greater than 2 months, persistence forecasts have lower correlation than that of any of the models. However, although the northern winter season is considerably better predicted than summer, a clear improvement in skill of the models over that of persistence is not seen for winter until leads of 6 months or more, while for summer persistence is the worst prediction for leads of 1 month or more. The model skill levels for all seasons combined (Fig. 6) differ from one another noticeably at all lead times, and some models that fare well (or poorly) at short lead times change their relative standing at intermediate or long lead times (e.g., Scripps). Averaged over all seasons, skills average somewhat lower than the 0.6 level found at 6-month lead in earlier studies (e.g., Barnston et al. 1994). However, a small number of current models, some of which do not predict out to 6 months lead, have shorter-lead skill levels that would exceed a 0.6 correlation if their forecast range were extended, *and* if their skill followed a downward slope with increasing lead time averaging that shown by other models having longer maximum lead times. Examples of models with such good or potentially good skill include those of the European Centre for Medium-Range Weather Forecasts (ECMWF), the National Aeronautics and Space Administration Global Modeling and Assimilation Office (NASA GMAO), and the Japan Meteorological Agency (JMA); the National Centers for Environmental Prediction Climate Forecast System (NCEP CFS; version 1) skill approximately equals 0.6. However, two caveats in the comparison of skills of today's models against models of 10–20 years ago are that 1) the ENSO variability during the 2002–11 period will be demonstrated to have been more difficult to predict than that over 1981–2011 in general and 2) the current set of predictions were made in real time, while those examined in previous studies were partly hindcasts. Both factors will be examined further below.

One reasonably might ask whether the skill differences at any lead time are sufficient, for a 9-yr period, to statistically distinguish among the performance levels of some of the models. The expected sampling error of a correlation coefficient is shown in Table 3 as 90% and 95% confidence intervals about several population correlations, for a sample size (*n*) of 9 and 30. For *n* = 9 and a population correlation of 0.8, for example, the 90% confidence interval ranges from 0.40 to 0.94—a wide interval. For skills derived from all seasons combined, the effective sample may be somewhat larger than 9 but will be far smaller than 108 because of the high month-to-month autocorrelation of the observed and modeled ENSO state (Fig. 3), given the typical lifetime of ENSO episodes of 7–11 months. The determination of statistically significant differences between skills of any pair of individual models is not a main goal of this study. However, the statistical significance of skill differences between dynamical and statistical model types *is* of interest, and is addressed below.

The correlation between model predictions and observations ref lects purely the discrimination ability of the models, since biases of various types do not affect this metric. However, such prediction biases (e.g., calibration problems involving the mean or the amplitude of the predictions) are also part of overall forecast quality, despite being correctable in many cases. To assess performance in terms of both calibration and discrimination, root mean square error (RMSE) is examined. Here the RMSE is standardized for each season individually, to scale RMSE so that climatology forecasts (zero anomaly) result in the same RMSE-based skill (of zero) for all seasons, and all seasons' RMSE contribute equally to a seasonally combined RMSE. Figure 8 shows RMSE as a function of target season and lead time, with a separate panel for each model, and Fig. 9 shows RMSE as a function of lead time for all seasons together. The ECMWF model has the lowest RMSE over its range of lead times. For lead times greater than 2 months, persistence forecasts have higher RMSE than that of any of the models. There is clearly some comparability between correlation skill (Fig. 5) and RMSE (Fig. 8), with models having highest correlation tending to have low RMSE. However, exceptions are discernible, due to the effects of mean biases and amplitude biases.

#### Mean bias and standard deviation ratio.

Toward examining specific calibration-related diagnostics individually, Fig. 10 shows model mean bias for each model, defined as the mean of the model prediction minus the mean of the observation, as a function of target season and lead time.

The seasonal patterns of mean bias (Fig. 10) indicate a common pattern of positive bias near the beginning of the calendar year at short lead times, migrating to later seasons with increasing lead time. Inspection of the individual time series indicates that this bias is related to the generally unpredicted early dissipation of the El Niño events of 2002/03 and 2006/07, underprediction of the northern winter peaks of the La Niña events of 2007/08 and 2008/09, and the failure to predict the late-emerging borderline La Niña events of 2005/06 and 2008/09 in northern autumn. The bias near the beginning of the calendar year can be attributed more generally to failure to predict exceptions to the typical tendency of persistence of the ENSO state between approximately October and February.

On the other hand, a tendency for negative bias is noted near the middle of the calendar year at short lead times, migrating to later seasons for longer leads. This bias can be traced to underprediction of the El Niños of 2002/03, 2006/07, and 2009/10 during their initial rapid growth phase during northern summer. Although underprediction of the La Niña events is similarly noted, there were slightly more El Niño than La Niña episodes during the period. The tendency for reversal from positive to negative model bias from the beginning to the middle of the calendar year at short lead times is consistent with the models being slow to develop any ENSO event during late northern spring and summer, and also late to end any event near the beginning of the calendar year, combined with the slight ENSO asymmetry favoring El Niño during the study period.

Figure 11 shows, for each model, the ratio of the interannual standard deviation of model predictions to that of the observations, as a function of target season and lead time. First, let us consider what the optimal values of this ratio are. Ideally, an ensemble mean, or for that matter a regression forecast, is a representation of the predictable signal. Ensemble averaging directly (and regression indirectly) removes unpredictable noise. Observations, on the other hand, contain both signal and noise. Therefore, the ratio of ensemble mean variance to observation variance is the ratio of signal variance to total variance (signal plus noise). In other words, the optimal value of the ratio of ensemble mean variance to observation variance is the fraction of explained variance, which is the square of the correlation coefficient. Ideally, then, the ratio of ensemble mean standard deviation to observation standard deviation should always be less than 1, and should be much less than 1 when skill is low. The signal versus noise basis for ensemble mean variance is discussed in Rowell (1998), and for regression-based statistical prediction variance in Hayes (1973). Thus, the standard deviation ratios shown in Fig. 11 should ideally look similar to the plots of correlation skill shown in Fig. 5. This is clearly not the case; the ratio is not bounded by 1 and does not decrease with lead time. In fact, the models tend to have standard deviation ratios that maximize near the time of year when skills are lowest. Although a 9-yr period is inadequate to establish robust estimates of correlation skills from which to derive the optimum prediction-to-observation standard deviation ratio, it is obvious that many of today's models have serious challenges reproducing realistic signal-to-noise ratios.

Examples of the contributions of mean bias and amplitude bias to RMSE in individual models can be identified. The NASA GMAO coupled model is one of the higher scoring models for correlation skill at lead times of 3 or more months, and the highest at long lead (Figs. 6 and 7). However, its RMSE is less favorable at intermediate (3–5 months) lead times (Figs. 8 and 9) because the model's good discrimination is offset by a substantially inflated amplitude at short to intermediate lead times, particularly just before the middle of the calendar year (Fig. 11) when the observations have smallest interannual variability and predictive skill is lowest. A seasonal pattern of mean bias (Fig. 10), also present in NASA GMAO, is not severe relative to that of other models. The NCEP CFS (version 1) model has a high correlation skill, and it is relatively free of bias.^{6} While its RMSE is also generally favorable, it is somewhat degraded at long lead times because the standard deviation ratio is too high just before the middle of the calendar year at intermediate and long leads (Fig. 11). The JMA model is one of the best performers in correlation and RMSE, hindered to some extent by mean bias in the early part of the year and too high a standard deviation ratio for the northern spring seasons. The ECMWF model shows exemplary performance in terms of correlation, mean bias, and a standard deviation ratio that is too high only for the MAM and AMJ seasons. Although it does not predict to long lead times,^{7} its RMSE is the lowest among all models in the study, particularly near the middle of the calendar year (Fig. 8) when prediction is most difficult (Stockdale et al. 2011).

Statistical models are typically designed to minimize RMSE and have little, if any, mean or amplitude bias in the training sample (but may have biases in independent predictions, particularly when they are temporally distant from the training period and a trend exists). The CPC canonical correlation analysis (CCA) model has relatively weak correlation skill at short lead times, other than near the very beginning and end of the calendar year; this weakness, combined with a strong positive bias during the first half of the calendar year, produces a comparatively high RMSE at short lead times despite a mainly favorable standard deviation ratio. The CPC constructed analog (CA) model has lower RMSE than that of most models because of slightly above-average correlation skill (Figs. 5 and 6) and remarkably little mean bias (Fig. 10) or amplitude bias. The University of California at Los Angeles (UCLA) *Theoretical Climate Dynamics* (TCD) model has similar performance attributes, and has the highest seasonally combined correlation skill among the statistical models (Fig. 6), exceeded by only a few dynamical models.

The examples above are only a subset of what could be an exhaustive consideration of the performance attributes of the 20 ENSO prediction models over the study period.

#### Target period slippage.

The rightward tilting of the color shading in the forecasts shown in Fig. 4 indicates that predictions correspond best with observations occurring earlier than the intended target season, particularly at intermediate and longer lead times. To capture this feature more clearly, Fig. 12 shows the correlation skills of each model as a function of lead time over a range of lag times, for all target seasons combined. The correlations at zero lag time reflect the skill of the predictions for the intended target period, while those at negative lag times show skills for target periods earlier than intended. Some degree of such “slippage” is noted for most of the models; this slippage tends to increase with increasing lead time. Marked slippage is noted for some of the statistical models, such as CPC Markov, the Climate Diagnostics Center linear inverse model (CDC LIM), the University of British Columbia neurological network model (UBC NNET), and The Florida State University regression model (FSU REGR). Two of the statistical models, UCLA TCD and CPC CA, lack substantial slippage. Dynamical models are not immune to slippage, as seen for example in the NCEP CFS (noted also in Wang et al. 2010), the Center for Ocean–Land–Atmosphere (COLA) anomaly model, and a few others. However, many of the models that exhibit relatively mild slippage are dynamical, such as ECMWF, NASA GMAO, the Met Office (UKMO), JMA, and the Lamont–Doherty intermediate coupled model (LDEO), although some of these models only forecast out to intermediate leads, precluding potentially larger slippage. Examination indicates that slippage has a common pattern of seasonal variation, being most pronounced for target periods in the middle of the calendar year and expanding to later seasons with increasing lead time (not shown). Slippage often presents itself when an El Niño or La Niña begins growing but is underpredicted (or predicted to grow later than observed) until it becomes at least moderately strong in the initial observations near the end of northern summer. Similarly, some models are systematically late in ending an ENSO event, creating a slippage effect in the first or second quarter of the calendar year by prolonging the event. Target period slippage can be described as an exaggerated tendency toward persistence, and therefore toward insufficient and/or late forecast signal evolution. As a systematic error, slippage can benefit from statistical correction (see Tippett et al. 2012).

#### Comparison of skill among models, and between dynamical and statistical models.

Case-to-case discrimination (as indicated by correlation skill; Murphy 1988) is often considered the most important component of final skill, since many calibration (bias-related) problems are correctable, while discrimination reflects a more fundamental ability of the prediction model. Figure 13 shows the anomaly of the squared correlation between the model predictions and the observations with respect to the mean of the squared correlation over all 20 models^{8} as a function of lead time and target season. Highlighted are some of the typical patterns of model differences, and patterns common to statistical or dynamical models collectively. With several exceptions, positive squared correlation anomalies are more commonly seen in the dynamical than statistical models, particularly in the comprehensive coupled dynamical models (e.g., NASA GMAO, NCEP CFS, JMA, ECHAM/MOM, and particularly ECMWF). A typical specific weakness of many of the statistical models is their short lead predictions for the northern summer target seasons, extending to later seasons for longer lead predictions. These target periods involve the northern spring predictability barrier, a longstanding difficulty in ENSO prediction (e.g., Jin et al. 2008). Predictions whose lead times do not traverse the months of April to June, by contrast, appear more equally successful among all models. Such predictions, focusing largely on continuing the evolution of ENSO events already in progress, are enhanced most noticeably by good calibration, including an accounting of ENSO's observed seasonal phase locking (e.g., Zaliapin and Ghil 2010). Statistical models have historically performed well for ENSO predictions that escape the need to predict new ENSO evolution or phase transitions associated with the northern spring predictability barrier.

Establishing statistical significance of skill differences between dynamical and statistical models for specific times of the year is difficult for a 9-yr study period. However, while there is only a small time sample, the fairly large number of models can be used to help overcome the short period length if we accept this 9-yr study period as a fixed condition.

Models are ranked by correlation skill for each season and lead time separately, using the 9-yr sample. Systematic differences in the ranks of the dynamical and statistical models are identified using the Wilcoxon rank sum test. Additionally, the average correlation of the dynamical and statistical models is compared using a standard *t* test, applied to the Fisher *Z* equivalents of the correlations. The *p* values resulting from these two statistical approaches are shown in Table 4. Although the difference-in-means test yields slightly more strongly significant results than the rank sum test, the season/lead patterns of the two approaches are similar. Significant differences are found at short lead time for the target periods near May–July, the seasons just following (and most strongly affected by) the northern spring predictability barrier. This significance pattern migrates to later target periods with increasing lead time, following the target periods corresponding to the fixed forecast start times of April or May. For forecasts whose lead times do not traverse the northern spring barrier, statistical versus dynamical skill differences are not significant.

Although significant differences are noted for specific seasons and leads, there is a multiplicity of candidate season/lead combinations, and 5% of the 108 candidates (i.e., 5 or 6 of them) are expected to be significant by chance. In the case of the Wilcoxon test, 20 entries are significant, and 10 are significant at the 1% level. For the difference-in-means test, 20 entries are significant and 15 are significant at the 1% level. To assess the field significance of the collective results (Livezey and Chen 1983), Monte Carlo simulations are conducted in which the model type is randomly shuffled 5,000 times, maintaining the actual number of dynamical and statistical models for the given lead time, and the set of local significances is regenerated. Using the sum of the *Z* or *t* values of all 108 cells as the test statistic, the percentage of the 5,000 randomized cases that exceeds the actual case is determined. The *Z* or *t* values are taken as positive when the correlation of the dynamical models exceeds that of the statistical models, and negative for the opposite case. The resulting field significances are 0.034 and 0.026 for the Wilcoxon rank test and *t* test, respectively, indicating significantly low probabilities that the set of local significances occurred accidentally. This finding suggests that the circumstance under which local significance is found, namely forecasts impacted by the northern spring predictability barrier being more successful in dynamical than statistical models, is meaningful and deserves fuller explanation.

A likely reason that dynamical models are better able to predict ENSO through the time of year when transitions (dissipation of old events and/or development of new events) typically occur is their more effective detection, through the initial conditions, of new evolution in the ocean–atmosphere system on a relatively short (i.e., intramonth) time scale— evolution that may go unnoticed by statistical models that use monthly or seasonal means for their predictor variables. Additionally, dynamical models are capable of nonlinear compounding effects of anomaly growth due to their time-marching design using small time steps, enabling fast evolution. While the details of such rapidly evolving scenarios indicated in a single model run may not have a high probability of actually occurring, consideration of predictions from an ensemble of many runs helps to define such probabilities. Statistical models might be able to compete better against dynamical models if they used finer temporal resolution, such as weekly means. Although use of coarser temporal resolution reduces noise and may serve a purpose similar to using large ensembles in dynamical models, there are circumstances under which rapid recent evolution on shorter time scales is crucial to successful prediction.

Statistical models need long histories of predictor data to develop their predictor–predictand relationships. This need presents a problem in using the three-dimensional observations in the tropical Pacific, such as the data from the Tropical Atmosphere Ocean–Triangle Trans-Ocean Buoy Network (TAO-TRITON) array (McPhaden et al. 1998), which dates from the 1990s. [However, some subsurface tropical Pacific data date back 10 or more years earlier in the eastern portion of the basin, and are available in the NCEP Global Ocean Data Assimilation System (GODAS) product.] This shorter data history precludes robust empirical definition of their predictive structures, and thus they are often omitted in statistical models. Although comprehensive dynamical models require a data history sufficient for verification and as a basis for defining anomalies, such a history is not basic to their functioning, and real-time predictions are able to take advantage of improved observing systems as they become available, potentially resulting in better initial conditions. While use of such crucial data suggests that dynamical models should be able to handily outperform statistical models, dynamical models have been burdened by problems such as initialization errors related to problems in data analysis/assimilation, and biases or drifts stemming from imperfect numerical representation of critical air–sea physics and parameterization of small-scale processes. As these weaknesses have improved, some comprehensive dynamical models have begun demonstrating their higher theoretical potential, and although they are much more costly than statistical models their performance may continue to increase (Chen and Cane 2008).

### Real-time predictive skill versus longer-period hindcast skill.

Because 9 years is too short a period from which to determine predictive skill levels with precision (Table 3), one reasonably might ask to what extent the performance levels sampled here could be expected to hold for future predictions. To achieve more robust skill estimates, a commonly used strategy is to increase the sample of predictions by generating retrospective hindcasts—“predictions” for past decades using the same model and procedures as in real time, to the extent possible. Cross-validation schemes are often used with statistical models, where varying sets of one or more years are withheld from the full dataset, and the remaining years are used to define the prediction model, which then is used to forecast the withheld year(s) (Michaelsen 1987). However, this sequential withholding technique can result in a negative skill bias under low skill conditions (Barnston and Van den Dool 1993) and/or a positive skill bias when information “leaks” into the training samples because some predictors or parameters have been selected using the same (or a similar) full dataset. In practice, there is no comparable procedure applied in dynamical model development, and model parameter choices are often made using the same data used to evaluate skill.

Fourteen of the 20 models whose 9-yr real-time forecast performance was discussed above (6 dynamical, 8 statistical) have produced hindcasts available to this study for the approximately 30-yr period of 1981 (or 1982) to 2011. To assess the consistency of their skills during the longer period and the 9-yr period of real-time predictions, the temporal correlation between hindcasts and observations is examined as a function of target seasons and lead time. Figure 14 shows a comparison of the correlation skills for the 9-yr real-time predictions (as in Fig. 5) and the 30-yr hindcasts for the subset of models having both datasets. Although the correlation plots are roughly similar, inspection shows generally higher hindcast skill levels for all of the models. Why do the hindcasts have higher skills? One explanation is that the 2002–11 period may have been more difficult to predict than most of the longer period. Another explanation is that skills tend to be higher in hindcasts than in real-time predictions because the cross-validation design may still allow inclusion of some artificial skill.

To assess the relative difficulty of the recent 9-yr period, the time series of uncentered correlation skills^{9} of sliding 9-yr periods, each phased 1 yr apart, are examined for the 22 running periods within 1981–2010. The resulting time series of correlation are shown in Fig. 15 (top), for lead times of 3 months and 6 months. It is clear that for all models, and for both lead times, the 2002–10 period, as well as the early to middle 1990s, posed a greater predictive challenge than most of the last three decades. As noted earlier, distinguishing features of the 2002–11 period have been 1) a lower amplitude of variability (no very strong events; Table 2); and 2) greater consecutive year alternations between El Niño and La Niña (Fig. 3). The former feature may be expected to reduce the upper limit of correlation skill by reducing the signal part of the signal-to-noise ratio. If the noise component remains approximately constant, and signal strength is somewhat restricted as during 2002–10, then the correlation is reduced. The bottom inset of Fig. 15 (top) shows the 9-yr running standard deviation of the observed Nino-3.4 SST anomalies, with respect to the 1981–2010 mean. The correlation between the running standard deviation and the model average skill is about 0.8 for both 3- and 6-month lead predictions, confirming a strong relationship between signal strength and correlation skill.

The average of the anomaly of the 2002–11 correlation with respect to that over 1981–2010 is approximately −0.14 (0.61 vs 0.75) for 3-month lead forecasts and −0.23 (0.42 vs 0.65) for 6-month lead forecasts. The −0.23 “difficulty anomaly” for 6-month lead forecasts is of greater magnitude than the deficit in skill of the real-time predictions during 2002–11 compared with the approximately 0.6 skill level found in earlier studies, suggesting that today's models would slightly (and statistically insignificantly) outperform those of the 1990s if the decadal fluctuations of the nature of ENSO variability were taken into account.

To examine the signal versus skill relationship with more temporal precision, a 3-yr time window is used in Fig. 15 (bottom), the bottom inset again indicating the running standard deviation. Within the 2002–10 period, the subperiod of 2003–07, in between the 2002/03 El Niño and the 2007/08 La Niña, is a focal point of low skill and low variability. The correlation between the 3-yr running standard deviation and model average skill is about 0.78 for both 3- and 6-month lead predictions, confirming a strong linkage between signal strength and correlation skill with higher temporal resolution.

A second cause of the recent real-time predictions having lower correlation skill than the 30-yr hindcasts is that using a period for which the verifying observations exist may permit inclusion of some artificial skill not available in real-time predictions. Attempts to design the predictions in a manner simulating the real-time condition (e.g., cross validation) reduce artificial skill, but subtle aspects involving predictor selection within a finite group of commonly used datasets often prevent its total elimination. A completely retroactive design, in which training only on years earlier than the year being forecast is permitted, may be a purer simulation of the real-time situation. Another impediment to the skill of real-time predictions includes such unavoidable inconveniences as delays in availability of predictor or initialization data, computer failure or other unforeseen emergencies, or human error. While this factor may seem minor, experience with the ENSO prediction plume has shown that such events occur more than once in a while.

Target period slippage was found in the real-time predictions, and one reasonably might expect it to appear in the longer-term hindcasts also. Figure 16 compares slippage in the selected models' hindcasts and corresponding real-time predictions and reveals a milder degree of slippage in the hindcasts. Could this difference be related to the higher average skill over the 30-yr period than its most recent 9 years? Verifications on sliding 9-yr periods show that slippage is greater during periods of lower average correlation skill (not shown). This finding suggests that slippage is most prominent when total error is greatest, which is the case for longer lead times and for seasons most impacted by the northern spring predictability barrier. When skill is relatively low, the models tend to err in the direction of missing the onset of new events (or being late in predicting the end of events), as opposed to predicting new events that turn out to be false alarms or ending events too early. In other words, the models are too persistent. During very strong events this bias is offset by high skill during the period when the event remains at high amplitude, since quasi-persistence is then an excellent forecast out to a longer lead time than is typically the case.

## DISCUSSION AND CONCLUSIONS.

Verification of the real-time ENSO prediction skills of 20 models (12 dynamical, 8 statistical) during 2002–11 indicates skills somewhat lower than those found for the less advanced models of the 1980s and 1990s. However, this apparent retrogression in skill is explained by the fact that the 2002–11 study period was demonstrably more challenging for ENSO prediction than most of the 1981–2010 period, due to a somewhat lower variability. A similar situation is found during the 9-yr period centered on the early 1990s. Thirty-year hindcasts for the 1981–2010 period yielded average correlation skills of 0.65 at 6-month lead time, but the real-time predictions for 2002–11 produced only 0.42. The fact that the recent predictions were made in real time, in contrast to the partially hindcast design in the earlier studies, introduces another difference with consequences difficult to quantify but more likely to decrease than increase the recent performance measures.

Based solely on the variability of 9-yr correlation skills of the hindcasts within the 30-yr period, ENSO prediction skill is slightly higher using today's models than those of the 1990s (0.65 vs about 0.6 correlation). *Decadal variability of ENSO predictability can strongly dominate the gradual skill improvements related to real advances in ENSO prediction science and models.*

Both real-time predictions and hindcasts have a tendency to verify with higher skill against observations occurring earlier than the intended forecast target period—a tendency that increases with increasing lead time. This “slippage” is related to systematically sluggish transitions: initiating new ENSO events too late and too weakly, and failing to end events on time. Slippage is more pronounced during multiyear periods of relatively weaker variability and skill, and for predictions starting from the time of year when ENSO transitions are most likely (March through May).

Unlike earlier results, the sample mean of skill of the dynamical models exceeds that of statistical models for start times between March and May when prediction has proven most challenging. Utilizing the fairly large numbers of dynamical and statistical models, the skill difference between the two model types is statistically significant under these seasonal conditions based on just 9 years of data, accepting this 2002–11 period as an unbiased and fixed condition of the analysis. The skill comparison by model type also passes a field significance test for all seasons and leads collectively, at the *p* = 0.03 level. The slightly greater dynamical skill is largely attributable to the most advanced and costly fully coupled ocean–atmosphere prediction systems having highest spatial resolution, using today's most advanced data assimilation systems for initialization (e.g., Balmaseda and Anderson 2009), and having the largest set of ensemble predictions. This finding suggests that continued implementation of better assimilation schemes for more realistic initial conditions, more detailed physics, higher resolution, and larger ensemble sizes makes additional advances in skill likely.

Why are the most comprehensive dynamical models found better able to predict ENSO than simpler dynamical models or statistical models through the time of year when ENSO transitions typically occur (previous episodes decay, or new episodes emerge)? A likely reason is their more effective detection and usage, through their initial conditions, of new information in the ocean–atmosphere system on a short (i.e., intramonth) time scale—information that may not play a major role in statistical models that use longer time means for their predictor variables. Statistical models may have potential for higher skill if their predictors were designed with finer temporal resolution. Because funding policy over the last two decades has favored dynamical over statistical prediction research proposals, the relatively greater advances in dynamical model skill is not surprising. Thus, while the major global forecast producing centers have come out with improved versions of their dynamical models every several years, most of the statistical models in this study have remained fundamentally unchanged over the last decade, and several even since the 1980s or 1990s [e.g., CPC Markov, CDC LIM, CPC CA, CPC CCA, the Colorado State University climatology–persistence model (CSU CLIPR), and UBC NNET].

However, aside from the funding preference factor, dynamical models are capable of nonlinear compounding effects of anomaly growth due to their time-marching design using small time steps, enabling faster ENSO state evolution than statistical models. Statistical models need long histories of predictor data to develop their predictor–predictand relationships, but the three-dimensional observations across much of the tropical Pacific (e.g. from the TAO-TRITON array) began only in the 1990s, precluding a robust empirical definition of their predictive relationships. Thus, these subsurface tropical Pacific predictors may be omitted in some statistical models. Comprehensive dynamical models require a data history for verification and for defining anomalies, but real-time predictions are possible without such a history, as data from current observing systems are available for their initial conditions.

On the other hand, dynamical models still have major specific problems such as initialization errors due to the details of data analysis/assimilation, and biases or drifts stemming from imperfect numerical representation of critical air–sea physics and parameterization of small-scale processes. As these weaknesses have improved, some of the comprehensive dynamical models have begun demonstrating their higher potential and may eventually prove to be the standard in ENSO prediction as they did decades ago in numerical weather forecasting. However, because of the profound fundamental differences between weather prediction and seasonal climate prediction, one cannot assume parallel evolutions of methodologies. In particular, seasonal climate is a large aggregation of running weather activity that collectively behaves in a considerably more linear fashion than its constituent weather activity, making it generally more amenable to statistical modeling than daily weather. For this reason, dynamical model skill has been slow to exceed statistical model skill in seasonal climate and/or ENSO prediction, and there is still an important role for statistical/empirical climate prediction when dynamical approaches fail to deliver useful predictions.

In conclusion, during the recent decade dynamical ENSO prediction models outperformed their statistical counterparts to a slight but statistically significant extent, primarily because of their better forecasts when traversing the northern spring predictability barrier. While doubt may be cast regarding whether the much greater cost of dynamical prediction is worth the benefit in performance, this cost is expected to decrease with time as science and engineering continue to advance.

## ACKNOWLEDGMENTS.

The authors appreciate the thoughtful comments and suggestions of the anonymous reviewers. Klaus Wolter provided a set of penetrating and useful comments and requests, leading to considerable improvements in the final paper. This work was funded by a grant/cooperative agreement from the National Oceanic and Atmospheric Administration (NA050AR4311004). The views expressed are those of the authors and do not necessarily reflect the views of NOAA or its subagencies.

## REFERENCES

## Footnotes

A supplement to this article is available online (10.1175/BAMS-D-11-00111.2)

^{1}

A warming trend in the Niño-3.4 region has been negligible within the 1981–2011 period, using the OI SST data (Reynolds et al. 2002), although the 1970s were about 0.4° cooler than the 1981–2011 average.

^{2}

The CPC CA is an example of a statistical model that produces an ensemble of predictions.

^{3}

Three dynamical models currently included on the plume that do not have sufficient time history on the plume to be evaluated in most of the analyses here are the Japan Frontier coupled model (Luo et al. 2005), the Météo-France coupled model (Déqué et al. 1994; Gibelin and Déqué 2003), and the COLA CCSM3 coupled model (Kirtman and Min 2009).

^{4}

These two events fell slightly short of the criteria used by the Climate Prediction Center to qualify as a nonneutral ENSO episode (Kousky and Higgins 2007), but 2008/09 qualifies using the 1981–2010 climatology for the OI SST data used here.

^{5}

Seasons are named using the first letters of the three constituent months (e.g., DJF refers to December–February).

^{6}

Since 2009, NCEP has applied a statistical correction to its Niño-3.4 SST predictions.

^{7}

The ECMWF predictions examined here were actually initialized one month earlier than were many of the other models, because the publicly available predictions are not updated until the middle of the month in order that the latest version may be sold commercially. This implies that the forecasts are of even greater quality than shown here.

^{8}

After squaring negative correlations, the negative sign is reinserted.

^{9}

For the uncentered correlation, the 9-yr means are not removed, so that standardized anomalies with respect to the 30-yr means, rather than the 9-yr means, are used in the cross products and the standard deviation terms.