## 1. Introduction

The El Niño–Southern Oscillation (ENSO) phenomenon has global impacts on climate and society (Ropelewski and Halpert 1996; Glantz 2001; Mason and Goddard 2001), and the ability to predict it has provided a conceptual and practical basis for seasonal climate forecasting. Despite improvements in prediction models and observing systems, ENSO forecasts remain uncertain. Forecast uncertainty is due to model and observation error, as well as the intrinsic chaotic nature of the ocean–atmosphere system. Therefore, forecast information is necessarily probabilistic and the most complete description of a forecast is its probability distribution. Ideally, an ENSO forecast distribution is the probability of the future ENSO state given the current observed state; that is, a conditional probability (DelSole and Tippett 2007). In practice, there are differences between such an ideal forecast distribution and the forecast distribution estimated from a numerical prediction model.

Past model performance data can sometimes be used to improve model output. For instance, model output statistics (MOS; in its simplest form, a regression between forecast and verifying observations) has long been used to correct biases and systematic errors in deterministic forecasts (Glahn and Lowry 1972). The same types of error affect ensemble forecasts and are manifest as errors in the central tendency of the forecast distribution. Additionally, there can be errors in the distribution of the forecast about the central tendency. For instance, not accounting for all sources of uncertainty results in an ensemble that has too little spread. Recently “ensemble MOS” methods have been developed to correct the entire forecast distribution (Wilks 2006; Wilks and Hamill 2007).

Another important development in forecasting is the use of multiple prediction models (Krishnamurti et al. 1999). Dynamical monthly climate predictions using state-of-the art coupled ocean–atmosphere general circulation models (CGCMs) are currently produced at a number of meteorological global prediction centers worldwide. Many of the coupled models differ in their representation of physical processes, in their numerical schemes, or in their use of observations to construct initial conditions. Therefore, different models have differing biases in their reproduction of interannual variability, as well as in their forecasts of oceanic and atmospheric climate (Shukla et al. 2000). Numerous studies have demonstrated that multimodel forecasts are generally more skillful than single-model forecasts (Krishnamurti et al. 1999; Kharin and Zwiers 2002; Palmer et al. 2004; van Oldenborgh et al. 2005; Kug et al. 2007). Conceptually, three reasons for the enhanced skill of multimodel forecasts are: (i) differing biases may cancel, (ii) the ensemble size is increased so that sampling error is reduced, and (iii) while the multimodel ensemble may have less skill than the “best” model for a particular forecast, it may not be possible to identify the best model a priori (Hagedorn et al. 2005). The simplest multimodel forecast procedure pools models and ensembles and estimates the forecast distribution from the resulting multimodel ensemble. More sophisticated methods weight models according to their performance. Using multiple linear regression, Krishnamurti et al. (1999) combined seasonal atmospheric climate (and also weather) forecasts from different models into a “superensemble” forecast.

This paper describes the use of ensemble MOS and multimodel combination methods for producing probabilistic ENSO forecasts. We focus on SST anomaly predictions for the Niño-3.4 region in the tropical Pacific in view of its demonstrated representativeness of the ENSO phenomenon (Barnston et al. 1997). Our goal is to determine, and explain to the extent possible, the effects of different multimodel prediction weighting techniques (including equal weighting) on the skill of probabilistic monthly ENSO forecasts. The forecasts take the form of probabilities of whether the monthly Niño-3.4 index will lie in the upper, lower, or middle two quartiles, corresponding to warm, cold, or neutral conditions, respectively. Ensembles of retrospective coupled ocean–atmosphere hindcasts using seven CGCMs are used. Descriptions of the retrospective hindcast data are given in section 2. The two probabilistic hindcast verification measures, one global and the other local in probability space, are described in section 3. The different ensemble MOS and multimodel techniques used to estimate the forecast probabilities are discussed in section 4. Skill results are presented in section 5, and a summary and some conclusions are given in section 6.

## 2. Data

Forecasts are taken from the Development of a European Multimodel Ensemble System for Seasonal-to-Interannual Prediction (DEMETER) project (Palmer et al. 2004). The DEMETER project consists of global coupled model seasonal hindcasts from seven coupled models developed by the European Centre for Research and Advanced Training in Scientific Computation (CERFACS), Istituto Nazionale di Geofisica e Vulcanologia (INGV), the European Centre for Medium-Range Weather Forecasts (ECMWF), Laboratoire d’Océanographie Dynamique et de Climatologie (LODYC), Max Planck Institute (MPI), Météo-France, and the Met Office (UKMO). The DEMETER forecasts were started on 1 February, 1 May, 1 August, and 1 November and extend to 6 months after their start. Our analyses are based on the common period of 1980–2001. We consider monthly averages and refer to the monthly average that contains the start date as the zero lead forecast. Thus, the longest lead forecast is the 5-month lead forecast. We consider forecasts with leads 1–5. All models are represented by 9-member ensembles for a total of 63 forecasts for each of the 4 start times per year for 5 lead times.

We use as the observations the “extended” Niño-3.4 index computed from Kaplan et al. (1998) until October 1981 and from the National Centers for Environmental Prediction (NCEP) optimal interpolation, version 2, (OIv2; Reynolds et al. 2002) projected onto the EOFs of Kaplan from November 1981 onward. Warm, cold, and neutral ENSO months are defined as months in which the Niño-3.4 index falls in the upper quartile, lower quartile, or middle two quartiles, respectively, of our 22-yr base period. Quartile boundaries are calculated on a monthly basis so that the ENSO definitions vary through the calendar year as shown in Fig. 1a. Thus, larger anomalies are required to satisfy the ENSO definition during the later parts of the calendar year when the year-to-year variability is larger. This definition of ENSO events differs from ones such as that of the National Oceanic and Atmospheric Administration (NOAA), in which there is a constant anomaly threshold throughout the seasonal cycle (Kousky and Higgins 2007), or those of both NOAA and the International Research Institute for Climate and Society (IRI) where sustained SST anomalies over some multimonth period are required. Figure 1b shows the time series of standardized Niño-3.4 anomalies and the ENSO category.

## 3. Skill measures

### a. Ranked probability skill score

*P*(cold),

*P*(neutral),

*P*(warm)] is

*O*(cold) is 1 when the observation is in the lowest quartile category and 0 otherwise; likewise,

*O*(warm) and

*O*(neutral) are 1 when the observation is in the corresponding category and 0 otherwise. RPS is oriented such that low scores indicate high forecast quality. The ranked probability skill score (RPSS; Epstein 1969) is

_{ref}is the RPS of a simple reference forecast such as the equal-odds climatological forecast

*C*= (0.25, 0.50, 0.25) or a persistence-based forecast. Positive RPSS indicates a forecast with greater skill than the reference forecast. In evaluating forecast quality, we average the RPSS of individual predictions using the climatological frequency forecast

*C*as the reference forecast.

Another simple benchmark forecast that we will evaluate is the conditional climatology, that is, the historical frequency-based probabilities of each category given the category of the preceding month. This kind of reference forecast was used in Mason and Mimmack (2002), in which it was called the “damped persistence strategy” in order to distinguish it from an outright forecast of persistence (with a probability of 100%) of the ENSO state existing at the time of the forecast. For instance, for a February forecast the conditional climatology would be the frequency of each ENSO category conditioned on the state in January. The historical frequencies are computed using the independent period 1900–79.

### b. Ignorance and rate of return

^{1}The ignorance Ig (measured in bits) of an individual forecast is

_{2}is the base-two logarithm and

*P*(observed category) is the forecast probability assigned to the observed category. Unlike RPSS, ignorance depends only on the forecast probability of the observed category. This skill measure infinitely harshly grades incorrect deterministic forecasts (i.e., forecasts of zero probability of events that occur). The reduction of ignorance relative to a climatological forecast is Ig − Ig

_{climo}, where

## 4. Estimation of probabilities from ensembles

The methods used here to estimate ENSO probabilities from multimodel forecast ensembles can be divided into three classes. In the first class are methods that only use forecast model output—observations are not used. We call methods in this class *uncalibrated methods.* Uncalibrated methods assume that the ensemble mean and ensemble statistics are correct. However, past model performance may indicate deficiencies in these quantities as well as strategies for improving forecast probabilities. Methods that use observations and take into account model performance are classified according to whether models are compared with each other. Methods that use equal weighting or individual model performance, independent of the behavior of other models, are called *independent calibration methods,* whereas methods in which the contribution of a model to the forecast probabilities depends on the behavior of the other models are called *joint calibration methods.* We now describe the methods in these three classes. A summary list of the methods is given in Table 1.

### a. Uncalibrated methods

A simple nonparametric method of estimating ENSO probabilities from a forecast ensemble is to count the number of ensemble members in the cold, neutral, and warm categories. Systematic model errors are corrected by using the forecast model’s climatology to compute quartile boundaries. Quartile boundaries are computed in a cross-validated fashion so that the current forecast is not included but all the other ones are. The category definition varies with start date and lead time. Furthermore, because of the cross-validation design, the category definition varies slightly even within one start date and lead time, depending on which year is being held out as the target of the forecast.

Errors in the climatological mean and variance of a single forecast model are corrected by using quartile definitions based on the model’s own forecast history. However, when a multimodel ensemble containing many models is formed, this procedure corrects only the systematic errors of the entire ensemble. We refer to this minimally adjusted ensemble as simply the multimodel ensemble (MM). A more refined correction scheme is to form the multimodel ensemble with the anomalies of each model with respect to its own climatology. We refer to this as the multimodel ensemble with bias correction (MM-bc). A further refinement is accomplished by forming the multimodel ensemble with the normalized anomalies of each model, thus removing any systematic differences between the ensemble variances of individual models. We refer to this option as the multimodel with variance and bias correction (MM-vc). In all three correction schemes, the bias and/or variance used to correct a particular forecast are computed without using that forecast (i.e., cross validation is always used).

We use the following two schemes to investigate the extent to which forecast probabilities can be parameterized by the multimodel ensemble mean alone. These methods address the question of the relative importance of the forecast mean and of the distribution about that mean for predictability. Kleeman (2002) examined this issue in several simple models including a stochastically forced coupled ocean–atmosphere model used to predict ENSO, finding that changes in the ensemble mean provided most of the prediction utility. Similar results have been seen in other climate problems (Kleeman 2002; Tippett et al. 2004, 2007; Tang et al. 2007). In the context of extended-range weather forecasts, Hamill et al. (2004) found ensemble spread not to be a useful predictor in constructing probability forecasts.

In the first such scheme (MM-c), the forecast distribution is taken to be a constant distribution about a varying mean. The forecast distribution is formed from the ensemble spreads from all years centered about the MM-bc ensemble mean. This constructed ensemble has more members than the actual ensemble by a factor of 22 and so sampling error is reduced. The constructed ensemble has a distribution about the ensemble mean that varies with start month and lead but not with forecast year.

Another way of using the ensemble mean to parameterize probabilities is to form a generalized linear regression (GLR) between the the ensemble mean and the uncalibrated model output probabilities, in particular those from the MM-bc method. We call this method MM-glr. This regression constructs a parametric connection between the ensemble mean and model forecast ENSO probabilities (i.e., no observations are used). When the probit model is used in the GLR, as is here the case, the procedure is related to fitting a Gaussian when the distribution is indeed Gaussian but sometimes performs better than Gaussian fitting for data that does not have a Gaussian distribution (Tippett et al. 2007). This method should not be confused with the commonly used method of developing a GLR between the ensemble mean and *observations*, which serves to calibrate the model with observations (Hamill et al. 2004). Rather the GLR used in the MM-glr method parameterizes the ensemble probabilities in terms of the ensemble mean but does not correct the model. Such a regression can serve to reduce the sampling variance of the counting probability estimate because of finite ensemble size (Tippett et al. 2007).

### b. Independent calibration

The simplest independent calibration method used here assumes that the bias-corrected multimodel ensemble mean is the best estimate of the forecast mean and then estimates the forecast distribution about it from past performance rather than from the ensemble distribution. Using a Gaussian distribution to model forecast uncertainty gives the method that we call MM-g. We fit the Gaussian distribution to past performance in a way that accounts for any systematic amplitude errors and bias (see appendix A). In the MM-g method, the forecast distribution is Gaussian with variance that is constant from one forecast to another, with its size determined by the correlation between forecasts and observations (see appendix A). Nonhomogeneous Gaussian regression (ngr) offers a more general framework with the forecast variance changing from one forecast to another (Gneiting et al. 2005; Wilks 2006). In the ngr method, the forecast variance is prameterized as a constant plus a term that is proportional to the ensemble variance of that forecast. The parameters of the ngr model are found by optimizing the continuous ranked probability score (CRPS), which requires minimizing the absolute forecast error (Gneiting et al. 2005). The constant variance parameters from MM-g are used to initialize the numerical optimization procedure used to minimize the CRPS.

Regressing the ensemble mean of each model with observations and then taking the averages of the separate regressions is a way of assigning different weights to each forecast model without directly comparing the models. This method is equivalent to multiple linear regression with the assumption that the models’ errors are uncorrelated. This procedure, unlike mulitple regression, avoids assigning negative weights to models with positive skill. Probabilities are assigned using a Gaussian distribution to model the uncertainty of the averaged regressions. We call this method “grsep.”

The independent calibration methods above give an entire (Gaussian) forecast distribution from which the probabilities of exceeding any particular threshold can be computed. Two independent calibration methods that give only the categorical probabilities are “glro” and MM-bow. In glro, generalized regressions are developed separately between the ensemble means of each model and the binary variables for the observed occurrence of each category. Then the resulting probabilities are averaged. Similarly, in MM-bow weights for the ensemble probabilities of each model and the climatological probabilities are found to optimize the log-likelihood of the observations, which is proportional to the average ignorance (Rajagopalan et al. 2002). Then the resulting probabilities are averaged similar to Robertson et al. (2004).

### c. Joint calibration

The most familiar joint calibration method is “superensembling” where optimal weights are found for the ensemble means by multiple linear regression (Krishnamurti et al. 1999). Superensembling is a special case of the more general method of forecast assimilation (Stephenson et al. 2005). A forecast probability distribution can be computed using a Gaussian distribution to model the uncertainty of the superensemble mean. We call this method “gr.” Applying methods where the model weights are estimated simultaneously from historical data is potentially difficult in the case of the DEMETER data because the number of models is relatively large compared to the common history period. For instance, the robust estimation of regression coefficients may be difficult. The same potential difficulty applies to the Bayesian optimal weighting (bow) method where optimal weights for the model and climatological probabilities are simultaneously computed (Rajagopalan et al. 2002). In the case of Gaussian regression, the number of predictors can be reduced, and hence the number of parameters to be estimated, using canonical correlation analysis (CCA) or singular value decomposition (SVD; Yun et al. 2003). Here we use CCA with two modes (cca).

DelSole (2007) introduced a Bayesian regression framework where prior beliefs about the model weights can be used in the estimation of regression parameters. These methods can be used to estimate the forecast mean, and a Gaussian distribution can be used to model its uncertainty. The ridge regression with multimodel mean constraint (rrmm) method uses as its prior the belief that the multimodel mean is the best solution. This is the same as finding the coefficients that minimize the sum of squared error plus a penalty term that grows as the coefficients become different from 1/(number of models), which is 1/7 in our case. The weight given to the penalty term determines the character of the regression coefficients. When infinite weight is given to the penalty term, the model weights are all the same, and rrmm is identical to MM-g. When no weight is given to the penalty term, the model weights are those given by multiple regression, and rrmm is the same as gr, multiple regression. Another method, ridge regression with multimodel mean regression (rrmmr) uses as its prior the belief that the models should be given approximately the same weight and penalizes coefficients that are unequal but does not penalize their difference from 1/(number of models). Like rrmm, when infinite weight is given to the penalty term, rrmmr is the same as MM-g (the multimodel mean is essentially regressed with observations) and when no weight is given to the penalty term, rrmmr is the same as gr. For both the rrmm and rrmmr methods, the parameter determining the relative weight of the penalty term is computed using a second level of cross validation.

## 5. Results

### a. Uncalibrated methods

A set of skill results arranged by start date and lead is shown in Fig. 2 for uncalibrated schemes, that is, ones that do not use any skill assessment calibration with respect to the observations. Average RPSS and compound average ROR values over all starts and leads are given in Table 1. Removing the mean bias from each model before pooling almost always improves skill, and often by substantial margins for both the RPSS and ROR skill measures (MM-bc versus MM). This result is consistent with Peng et al. (2002). Making the interensemble variances of each model identical (MM-vc) tends to slightly further increase RPSS and ROR skills, although the effect is not consistently positive. Two schemes that use only the ensemble means to construct the probability distribution (MM-c and MM-glr) have skill very close to that of MM-vc, but generally do not exceed it. We therefore consider the MM-vc as the benchmark among the ensembling methods to be tested below, as it represents a most general and basic calibration of the models’ individual biases in mean and interannual variability, while retaining the individual ensemble distributions with their year-to-year variations of spreads and shapes.

Figure 2 indicates that much of the useful forecast skill can be attributed to variations in the ensemble mean rather than variations in higher-order statistics, as MM-c and MM-glr skill tends to be only very slightly lower than MM-vc. This finding is consistent with the conclusions of other recent studies of other forecast variables (Kharin and Zwiers 2003; Hamill et al. 2004; Tippett et al. 2007; Tang et al. 2007). Examination of individual forecasts reveals that the slight shortcoming in MM-c for November starts is due mainly to the single year 1983 (shown in Fig. 3), when, initialized with cold conditions, all the forecasts gave highest probability to the cold category, but the observation was in the neutral category. The MM-vc forecast had slightly weaker probabilities for the cold category than did MM-c and so was penalized less.

The greatest benefit of the dynamical predictions compared to the conditional climatology frequency forecast occurs for forecasts starting in May. This is reasonable in view of the fact that May is the time of the northern spring ENSO predictability barrier—a time when warm or cold ENSO episodes are roughly equally likely to be dissipating (having matured several months earlier) versus growing toward a maturity to occur later that calendar year. The frequency forecast incorporates both possibilities indiscriminately from the entirety of the data history, while the dynamical models have the opportunity to respond to initial conditions that may help identify the direction of change in the ENSO condition (e.g., from subsurface tropical Pacific sea temperature structure). On the other extreme, the lead 1 forecasts starting in February have slightly less skill than the frequency forecasts as measured by RPSS. This could relate to the fact that it is a time of year when existing ENSO episodes are in a process of weakening toward neutral, a tendency that can be captured by the simple frequency forecasts.

*N*according to (see appendix B for details):

### b. Independent calibration

Skill measures arranged by start date and lead are given in Fig. 5 for the independent calibration methods. Average RPSS and compound ROR are given in Table 1. For the most part, the independent calibration methods have comparable skill with grsep having the best overall performance, with its ROR slightly exceeding that of MM-vc. However, the glro method has significantly poorer skill than the other methods. The poor skill obtained when using a generalized linear regression between the ensemble mean and the occurrence of a category is due to the method erroneously giving probabilities that are too strong. This behavior appears to be an effect of the small sample size and the fact that the predictands used to develop the model are 0 or 1. Moreover, in some cases the numerical method for estimating the glro parameters fails to converge. In contrast, the predictands in MM-glr method are the MM-bc probabilities which are not so extreme. Therefore the parameters of the generalized linear model are “easier” to fit, and the MM-glr tends to give milder probability shifts. Because the glro probabilities sometimes approach zero or one, the ROR scores them harshly when they are wrong. The MM-bow method suffers from similar problems and has somewhat poor skill compared to the other independent calibration methods.

### c. Joint calibration

We next explore the potential to improve skill scores further by joint calibration. Skill measures arranged by start date and lead are given in Fig. 6 for the joint calibration methods. Average RPSS and compound ROR are given in Table 1. The joint calibration methods have mostly comparable skills with only occasional improvement on that of MM-vc especially in the November starts. The methods gr and bow have poorer overall skill than the other methods, especially when measured by the ROR. The multiple regression method gr has substantially poorer skill outside of the November starts. The result that gr skill is poor when the skill level is low and is good when the skill level is high (November starts) is consistent with the usual estimate of the variance of the regression coefficients that depends on sample size and skill level. Apparently, the period of data (22 yr) is not long enough to fit the regression coefficients reliably when seven predictors are used. When forecasts are made on cases within the training sample (i.e., when cross validation is not used), results (not shown) using multiple linear regression are excellent, and usually exceed those of MM-vc. This indicates overfitting to the short sample data. Similar behavior was seen by Kang and Yoo (2006) who, using an idealized climate models and deterministic skill scores, noted that the superensemble method leads to overfitting that was more severe when skill is low.

One way to avoid overfitting is to reduce the number of predictors and hence the number of parameters that need to be estimated. In the cca method we reduced the number of predictors by using two CCA modes as predictors. The cca method had the best overall performance of the jointly calibrated methods. Comparable performance was achieved by the Bayesian regression methods, which also are designed to avoid overfitting. The Bayesian regression methods penalize deviations from equal weighting and are less likely to overfit. The choice of the weight given to the penalty is chosen in a second level of cross validation, that is, the penalty parameter for a particular forecast is chosen using data that does not include that forecast. This parameter does not seem to be optimally estimated since, if it were, rrmm would always be better than MM-g (it is not) since MM-g is a special case of rrmm with infinite weight given to the penalty term. In fact, when the weight is determined using all the data, rrmm is better than MM-g. The Bayesian regression method that penalizes unequal weights, rrmmr, is slightly better over all than MM-g. It is consistently better than MM-g if the penalty weight is chosen using all the data. These problems suggest a fundamental problem with the rrmm and rrmmr methodologies since they are not able to use the data to determine when the differences between the models in the historical record are insufficient to weight one model more than another (DelSole 2007).

Thus, while skill results for some of the methods are relatively resistant to overfitting (grsep, rrmm, rrmmr, cca) and can be considered representative of expected skill on truly independent (e.g., future) data, a disappointment is that they are generally not superior to the skill of methods with equal weighting (MM-vc) despite apparent skill differences among the models (Fig. 4).

## 6. Summary and conclusions

This study assesses the skill of probabilistic ENSO forecasts and examines the benefits of ensemble MOS and multimodel combination methods using recent coupled model SST forecast histories of monthly Niño-3.4 tropical Pacific SST categories. Warm, neutral, and cold categories are defined using the upper, middle two, and lower quartile categories, respectively. Forecast data from the seven coupled models of the DEMETER project are used, with start dates spanning the period 1980–2001. Forecast skill is accessed using ranked probability skill score and rate of return; ROR is a skill measure that attaches an investment value to the forecast information and has the same ranking as the ignorance or log-likelihood skill score (Roulston and Smith 2002).

Three classes of multimodel ensemble MOS methods for producing forecast categorical probabilities are examined: uncalibrated, independent calibration, and joint calibration. Uncalibrated methods only use model output, and do not use observations or model performance. Independent calibration methods use observations and model performance to calibrate each model separately. Joint calibration methods use each model’s performance as well as relations between the performance of models.

In uncalibrated methods, two model data preprocessing steps are performed: 1) removing the climatology of each model and 2) normalizing the anomalies to have the same variance. These calibrations to the individual models are shown to increase skills when multimodel ensembling with equal model weighting is used as a baseline condition. Such multimodel ensembles are found to have higher skills than those from the individual models given the same calibrations, with only occasional exceptions. The dependence of RPSS on ensemble size is known for a perfectly reliable model (Tippett et al. 2007). This dependence shows that the advantage in skill of the multimodel ensembles over the single-model average skill is larger than that expected because of the larger size of the multimodel ensemble. Other studies have directly shown that the skill improvement of the multimodel ensemble is greater than that achieved by increasing the ensemble size of a single model (Hagedorn et al. 2005). Therefore, the advantage of the multimodel ensemble over the small single-model ensemble is consistent with both increased ensemble size and reduced model error. Forecast distributions constructed from the multimodel ensemble mean alone had nearly as much skill as ones constructed from the forecast ensemble, indicating that much of the useful forecast information is contained in the mean of the forecast distribution. Interestingly, while ENSO predictive skill is generally low during boreal summer relative to other times of the year, the advantage of the model forecasts over forecasts based on historical frequencies is greatest at this time.

Overall, independent calibration performed better than joint calibration for both the RPSS and for the ROR score unless steps were take to avoid overfitting. The reason for this seems to be the shortness of the length of the historical record (i.e., 22 yr) relative to the number of models (i.e., seven). Joint calibration estimates seven parameters simultaneously from the data while independent calibration estimates a single parameter at a time from the data. Even in the case of univariate linear regression, skill can be degraded when the sample size is small (Tippett et al. 2005). Reducing the number of predictors or incorporating prior information about the weights were effective ways to prevent overfitting. On the other hand, joint calibration methods that did not restrict the weights had poorer skill than independent methods.

The conclusion and recommendation are that little or nothing is gained, and something could be lost, in attempting to use the most general joint calibration schemes for seven models contributing to multimodel ensembles based on 20–25 yr of model history. Rather, independent calibration, as well as joint calibration with overfitting prevention measures, can be used without skill sacrifice when forecasting outside of the training sample. It is assumed that when one or more models have skills that are clearly out of line with the others in general and obvious ways, they should simply be removed from the ensembling exercise. However, unless the difference in skill is sufficiently large, removing the “worse” model may have the unintended consequence of degrading skill (Kharin and Zwiers 2002; Kug et al. 2007). While some of the models used in this study appeared to have generally somewhat higher skills than others, none had unusually low skill.

## Acknowledgments

The authors thank Simon Mason, Andreas Weigel, and two anonymous reviewers for their comments and suggestions. The authors are supported by a grant/cooperative agreement from the National Oceanic and Atmospheric Administration (Grant NA05OAR4311004). The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA or any of its subagencies.

## REFERENCES

Barnston, A. G., M. Chelliah, and S. B. Goldenberg, 1997: Documentation of a highly ENSO-related SST region in the equatorial Pacific.

,*Atmos.–Ocean***35****,**367–383.DelSole, T., 2004: Predictability and information theory. Part I: Measures of predictability.

,*J. Atmos. Sci.***61****,**2425–2440.DelSole, T., 2007: A Bayesian framework for multimodel regression.

,*J. Climate***20****,**2810–2826.DelSole, T., and M. K. Tippett, 2007: Predictability: Recent insights from information theory.

,*Rev. Geophys.***45****.**RG4002, doi:10.1029/2006RG000202.Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories.

,*J. Appl. Meteor.***8****,**985–987.Glahn, H. R., and D. A. Lowry, 1972: The use of model output statistics (MOS) in objective weather forecasting.

,*J. Appl. Meteor.***11****,**1203–1211.Glantz, M. H., 2001:

*Currents of Change: Impacts of El Niño and La Niña on Climate and Society*. 2nd ed. Cambridge University Press, 266 pp.Gneiting, T., A. Raftery, A. Westveld, and T. Goldman, 2005: Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation.

,*Mon. Wea. Rev.***133****,**1098–1118.Good, I. J., 1952: Rational decisions.

,*J. Roy. Stat. Soc. Series B***14****,**107–114.Hagedorn, R., F. J. Doblas-Reyes, and T. N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting—I. Basic concept.

,*Tellus A***57****,**219–233. doi:10.1111/j.1600–0870.2005.00103.x.Hamill, T. H., J. S. Whitaker, and X. Wei, 2004: Ensemble reforecasting: Improving medium-range forecast skill using retrospective forecasts.

,*Mon. Wea. Rev.***132****,**1434–1447.Kang, I-S., and J. Yoo, 2006: Examination of multi-model ensemble seasonal prediction methods using a simple climate system.

,*Climate Dyn.***26****,**285–294.Kaplan, A., M. A. Cane, Y. Kushnir, A. C. Clement, M. B. Blumenthal, and B. Rajagopalan, 1998: Analyses of global sea surface temperature 1856–1991.

,*J. Geophys. Res.***103****,**C9. 18567–18589.Kharin, V. V., and F. W. Zwiers, 2002: Climate predictions with multimodel ensembles.

,*J. Climate***15****,**793–799.Kharin, V. V., and F. W. Zwiers, 2003: Improved seasonal probability forecasts.

,*J. Climate***16****,**1684–1701.Kleeman, R., 2002: Measuring dynamical prediction utility using relative entropy.

,*J. Atmos. Sci.***59****,**2057–2072.Kousky, V. E., and R. W. Higgins, 2007: An alert classification system for monitoring and assessing the ENSO cycle.

,*Wea. Forecasting***22****,**353–371.Krishnamurti, T. N., C. M. Kishtawal, T. E. LaRow, D. R. Bachiochi, Z. Zhang, C. E. Williford, S. Gadgil, and S. Surendran, 1999: Improved weather and seasonal climate forecasts from a multi-model superensemble.

,*Science***286****,**1548–1550.Kug, J. S., J. Lee, and I. Kang, 2007: Global sea surface temperature prediction using a multi-model ensemble.

,*Mon. Wea. Rev.***135****,**3239–3247.Mason, S. J., and L. Goddard, 2001: Probabilistic precipitation anomalies associated with ENSO.

,*Bull. Amer. Meteor. Soc.***82****,**619–638.Mason, S. J., and G. M. Mimmack, 2002: Comparison of some statistical methods of probabilistic forecasting of ENSO.

,*J. Climate***15****,**8–29.Murphy, A. H., 1973: A new vector partition of the probability score.

,*J. Appl. Meteor.***12****,**595–600.Palmer, T., and Coauthors, 2004: Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER).

,*Bull. Amer. Meteor. Soc.***85****,**853–872.Peng, P., A. Kumar, H. van den Dool, and A. G. Barnston, 2002: An analysis of multimodel ensemble predictions for seasonal climate anomalies.

,*J. Geophys. Res.***107****.**4710, doi:10.1029/2002JD002712.Rajagopalan, B., U. Lall, and S. E. Zebiak, 2002: Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles.

,*Mon. Wea. Rev.***130****,**1792–1811.Reynolds, R. W., N. A. Rayner, T. M. Smith, D. C. Stokes, and W. Wang, 2002: An improved in situ and satellite SST analysis for climate.

,*J. Climate***15****,**1609–1625.Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size.

,*Quart. J. Roy. Meteor. Soc.***127****,**2473–2489.Robertson, A. W., U. Lall, S. E. Zebiak, and L. Goddard, 2004: Improved combination of multiple atmospheric GCM ensembles for seasonal prediction.

,*Mon. Wea. Rev.***132****,**2732–2744.Ropelewski, C. F., and M. S. Halpert, 1996: Quantifying Southern Oscillation–precipitation relationships.

,*J. Climate***9****,**1043–1059.Roulston, M. S., and L. A. Smith, 2002: Evaluating probabilistic forecasts using information theory.

,*Mon. Wea. Rev.***130****,**1653–1660.Shukla, J., and Coauthors, 2000: Dynamical seasonal prediction.

,*Bull. Amer. Meteor. Soc.***81****,**2593–2606.Stephenson, D. B., C. A. S. Coelho, F. J. Doblas-Reyes, and M. Malmaseda, 2005: Forecast assimilation: A unified framework for the combination of multi-model weather and climate predictions.

,*Tellus A***57****,**253–264.Tang, Y., H. Lin, J. Derome, and M. K. Tippett, 2007: A predictability measure applied to seasonal predictions of the Arctic Oscillation.

,*J. Climate***20****,**4733–4750.Tippett, M. K., R. Kleeman, and Y. Tang, 2004: Measuring the potential utility of seasonal climate predictions.

,*Geophys. Res. Lett.***31****.**L22201, doi:10.1029/2004GL021575.Tippett, M. K., A. G. Barnston, D. G. DeWitt, and R-H. Zhang, 2005: Statistical correction of tropical Pacific sea surface temperature forecasts.

,*J. Climate***18****,**5141–5162.Tippett, M. K., A. G. Barnston, and A. W. Robertson, 2007: Estimation of seasonal precipitation tercile-based categorical probabilities from ensembles.

,*J. Climate***20****,**2210–2228.van Oldenborgh, G. J., M. Balmaseda, L. Ferranti, T. Stockdale, and D. Anderson, 2005: Did the ECMWF seasonal forecast model outperform statistical ENSO forecast models over the last 15 years?

,*J. Climate***18****,**3240–3249.Weigel, A. P., M. A. Liniger, and C. Appenzeller, 2007: The discrete Brier and ranked probability skill scores.

,*Mon. Wea. Rev.***135****,**118–124.Wilks, D. S., 2006: Comparison of ensemble-MOS methods in the Lorenz ‘96 setting.

,*Meteor. Appl.***13****,**243–256.Wilks, D. S., and T. M. Hamill, 2007: Comparison of ensemble-MOS methods using GFS reforecasts.

,*Mon. Wea. Rev.***135****,**2379–2390.Yun, W. T., L. Stefanova, and T. N. Krishnamurti, 2003: Improvement of the multimodel superensemble technique for seasonal forecasts.

,*J. Climate***16****,**3834–3840.

## APPENDIX A

### A Gaussian Model for Forecast Probabilities

*r*between the mean forecast and observations. The correlation corresponds to a signal-to-noise ratio

*S*given by

*σ*

^{2}to the noise variance

*σ*

^{2}

_{N}. Therefore, the noise variance is given by

*σ*

^{2}

_{N}=

*σ*

^{2}/

*S*, where we use the variance of the mean forecasts as the signal variance. This gives a forecast distribution whose mean is the mean forecast and its variance is

*σ*

^{2}

_{N}. Category boundaries are computed using the climatological distribution whose mean is the long-term average of the mean forecast and whose variance is the sum of the signal and noise variances.

## APPENDIX B

### Dependence of RPSS on Ensemble Size

*N*according to

*m*-category forecast is

*P*is the probability assigned to the

_{i}*i*th category and

*O*is 1 when the observation falls into the

_{i}*i*th category and 0 otherwise. Assuming that the forecast is reliable means that the forecast probability is indeed the probability that an observation will fall into the

*i*th category. This assumption of reliability allows us to compute the expected value of the RPS as

*δ*is defined to be 1 when

_{ij}*i = j*and 0 otherwise. Direct manipulation of this expression gives

*N*-member ensemble, the forecast probability of the

*i*th category is

*P*+ ε

_{i}*, where ε*

_{i}*is the error in the forecast of the*

_{i}*i*th category due to sampling variability. The expected RPS of the

*N*-member ensemble is

*ϵ*

_{i}〉 = 0 and that only quadratic error terms appear in 〈RPS(

*N*)〉. In particular, a direct calculation gives

The result in (B8) can be interpreted as expressing the fact that the RPS of an *N*-member ensemble is the sum of the sampling variability of the observation and the ensemble. Equation (B8) can be used, assuming reliability, to estimate the infinite-ensemble skill given the finite-ensemble skill, similar to the debiased RPSS of Weigel et al. (2007). Although Weigel et al. (2007) use a different set of assumptions, their debiased RPSS agrees to leading order with the result here.

List of the methods and their key properties. Compound averaged ROR and average RPSS computed from all starts and leads. Methods in each class are listed in order of compound averaged ROR. Maximum values for each class of method are shown in boldface.

^{1}

In a Gaussian perfect model setting with an ensemble mean forecast whose correlation with observations is *r*, the mutual information is −log(1 − *r* ^{2}).