## 1. Introduction

In recent years probabilistic ensemble forecast systems have been established in a wide area of applications. The probabilistic nature of these forecasts requires verification techniques based on probabilistic skill measures. However, there is no general agreement on the best skill score. The choice depends on the particular application considered or the forecast system being used. Examples are Brier scores (BSs) or the relative operating characteristics (for details see Swets 1973; Mason 1982; Jolliffe and Stephenson 2003; Wilks 1995). The Brier score, for instance, is essentially the mean squared error (mse) of the probability forecast of a dichotomous event. An example is a probability forecast of the winter mean temperature to be above or below the climatological mean (Palmer et al. 2000). For a range of applications, such a dichotomous score gives an incomplete picture since the entire shape of the probability function is not considered. A multicategory score that measures the shape as well as the central tendency of the whole probability density function (PDF) is more eligible. An often used score for such applications is the ranked probability score (RPS; Epstein 1969; Murphy 1969; Murphy 1971).

The RPS is based on the cumulative density function (CDF) and is classically defined by the quadratic norm, hereafter denoted as RPS_{L}_{=2}. The score is the integrated squared difference between the forecasted and the observed CDF. It can be seen as the probabilistic extension of the mse. However, the RPS_{L}_{=2} is applied in the cumulative probability space and not in the physical space, that is, the integration is taken over categories. It can be interpreted as an extension of the Brier score for finite ordered categories. The extension to an infinite number of classes results in the continuous RPS (Unger 1985; Hersbach 2000)

Current ensemble prediction systems (EPSs) for medium-range forecasts (3 to 10 days) typically use ensemble sizes varying from 17 [National Centers for Environmental Prediction (NCEP)] to 50 [European Centre for Medium-Range Weather Forecasts (ECMWF)] members to construct the probability density function (Toth and Kalnay 1993; Tracton and Kalnay 1993; Buizza et al. 1998). For long-range forecasts, with prediction times of months to years, the ensemble size is usually smaller. The reason for having small ensemble sizes lies primarily in computational costs, which is particularly true for hindcast experiments that are used for verification and calibration. For example, the 15-yr hindcast database of the ECMWF Seasonal Forecast System 2 mostly consists of five ensemble members.

Such a small ensemble size (and number of forecasts) can lead to statistical problems stemming from large sampling errors. In the frame of numerical weather prediction Buizza and Palmer (1998) have shown that the forecast skill can be improved by increasing the ensemble size, but the extent to which improvement occurs depends on the measure used. For the RPSS_{L}_{=2}, they found major skill improvement for at least up to eight members. In a perfect seasonal forecast approach, Dequé (1997) determined the ensemble size required for skill score saturation for various parameters. An ensemble size of 40 was suggested for European temperature forecasts. Kumar et al. (2001) have also explored the influence of the ensemble size on the RPSS_{L}_{=2} and noted that the RPSS_{L}_{=2} is strongly negatively biased for a small ensemble size. Based on this bias the minimum ensemble size required to predict a climate signal given a known signal-to-noise ratio (mean shift of the anomaly distribution in standardized units) was derived.

Here we show that the substantial negative bias of the RPSS_{L}_{=2} for small ensemble sizes is primarily a consequence of the discretization and squaring measure in its formulation. Two strategies are introduced that overcome these deficiencies (section 2). The characteristics of the suggested modifications of the RPSS_{L}_{=2} are examined with a synthetic example (section 3). The new techniques are then applied to a real seasonal winter temperature forecast based on the ECMWF Seasonal Forecast System 2 for the years 1987–2001 (section 4) and to the climate signal-to-noise detection problem (section 5). Finally, a conclusion and discussion are given in the last section.

## 2. Definition of RPSS_{L}

_{L}

_{L}_{=2}is a quadratic measure, and thus larger deviations from the actual probability are penalized much stronger than smaller ones. The RPS

*for any norm*

_{L}*L*is defined as

*N*is the number of forecasts,

*k*is the forecast index, and

*J*denotes the number of the classes, and

*L*is the norm. The cumulative probabilities of the forecasts

*Y*and the observations

_{j}*O*are defined as

_{j}*Y*

_{j}= Σ

^{j}

_{i=1}

*y*

_{i}and

*O*

_{j}= Σ

^{j}

_{i=1}

*o*

_{i}, where

*y*and

_{i}*o*are the probability of the forecast and observation, respectively, for the class

_{i}*i*. The RPS

*is zero in case of a perfect forecast and positive otherwise. The calculation of the skill score is based on the comparison of the forecast score (RPS*

_{L}

_{L}_{,FC}) to a reference forecast score (RPS

_{L}_{,CL}).

*becomes*

_{L}*indicates a forecast benefit compared to the reference forecast. For*

_{L}*L*= 2 the skill score becomes the standard squared definition (Wilks 1995), whereas

*L*= 1 gives the absolute skill score RPSS

_{L}_{=1}. In this study, the distribution of the reference forecast is defined by a Gaussian fit to the observations. The equiprobable classes are obtained from the distribution of the reference forecasts.

## 3. A synthetic example

In the following a white noise climate system is employed in order to explore the sensitivity of the RPSS_{L}_{=2} to the ensemble size. Forecasts and observations are chosen to be random Gaussian time series. In such a system the expected outcome should be a value that gives no benefit compared to a reference forecast. The white noise climate consists of a sample of 100 000 cases. For each skill score, 300 ensembles are randomly chosen and verified against randomly chosen observations. This gives a robust estimate of the skill score. This procedure is repeated 100 times for randomly chosen observations. This gives 100 skill scores from which the mean and the 95% confidence intervals are calculated. These confidence intervals are chosen as guidance for further local testing of the real seasonal forecasts against random time series. Other significance tests (e.g., field significance) are available, and a more complete discussion is found in Wilks (1995) and Nichollis (2001). In the light of a potential application to the ECMWF forecast system, the skill scores are calculated with up to 40 ensemble members and a number of 15-yr forecasts. Three equiprobable classes (above, normal, and below) are used.

Figure 1a shows the dependence of the RPSS_{L}_{=2} on the ensemble size for white noise climate forecasts. The mean of the RPSS_{L}_{=2} exhibits negative skills ranging from –0.20 up to −0.02 for an ensemble system of size 5 and 40, respectively. These values are below the expected value of zero benefit. With larger ensemble size the bias decreases slowly toward zero. The 95% confidence intervals (thin lines) are asymmetric and vary from about −0.45/0.05 for a 5-member system to about −0.10/0.05 for a size of 40. The confidence intervals as a function of the number of equiprobable classes are shown in Fig. 2a. The confidence intervals of a 40-member forecast system (black) are much closer to the mean values than for a 5-member system (gray). A strong asymmetry around zero is visible for a small ensemble size. The magnitudes of the confidence intervals are largest for two classes. Finally, the bias and confidence intervals are evaluated as a function of the number of forecasts (Fig. 3). The confidence intervals are closer to the mean for a higher number of forecasts. However, the bias remains unchanged, indicating that it is related to the ensemble size.

The corresponding results for the RPSS_{L}_{=1} are shown in Figs. 1b and 2b. The mean of the RPSS_{L}_{=1} (Fig. 1b) is located close to zero even for the smallest ensemble size, for example, for a five-member system. The 95% confidence intervals are spread symmetrically around the zero mean with a magnitude of about ±0.14. For a large ensemble size (40) the spread is about ±0.05. The confidence intervals are now symmetric around zero for all classes (Fig. 2b) and only depend on the ensemble size, with reduced spread for larger samples. Obviously, the RPSS_{L}_{=1} has no bias for small ensemble sizes.

_{L}_{=2}and RPS

_{L}_{=1}with white noise climate forecast. Suppose a forecast for two equiprobable classes (in this case, the RPS

*reduces to the BS). Since the cumulative probabilities for the forecast and observation in the second class are always both equal to one, the RPS*

_{L}

_{L}_{=2,}

*value reduces to*

_{k}

_{L}_{=1}reduces to

_{L}_{=2,CL}is 1/4, whereas the RPS

_{L}_{=1,CL}is 1/2.

*Y*

_{1}can only take on a set of discrete values. These probabilities are (0, 1/

*m*, 2/

*m*, . . . , (

*m*− 1)/

*m*, 1) for an ensemble system with

*m*members. For a three-member ensemble system these values are explicitly (0, 1/3, 2/3, 1). The mean RPS

_{L}_{=2,FC}is then (1/8)1 + (3/8)(2/3)

^{2}+ (3/8)(1/3)

^{2}+ (1/8)0 = (1/3), where the forecast probabilities are weighted by the relative frequency of the occurrence. Evidently, the squaring of the cumulative measures gives an RPS

_{L}_{=2,FC}that is larger than the expected reference value RPS

_{L}_{=2, CL}of 1/4. As a consequence, the RPSS

_{L}_{=2}takes a negative value. In a general formulation of the bias of a white noise climate forecast, each possible cumulative probability value is weighted by the probability of occurrence that is given by the binominal distribution. Thus, for a two-class system the scores of the unskilled ensemble forecast is given by

For *L* = 2, the RPS_{L}_{=2,FC} is always larger than RPS_{L}_{=2,CL}. Figure 4 shows the bias of the RPSS* _{L}* for different ensemble sizes and norms. The analytical bias is almost equal to the bias in the white noise climate forecast (Fig. 1a). Finally, note that for more than two classes the relative frequency of occurrence does not linearly increase and Eqs. (5) and (6) increase in complexity. But the bias remains also for forecast systems that have ensemble sizes that are an integer multiple of the categories (not shown).

For *L* = 1 the RPS_{L}_{=1,FC} of a three-member ensemble system gives (1/8)1 + (3/8)(2/3) + (3/8)(1/3) + (1/8)0 = 1/2. The outcome of RPS_{L}_{=1,FC} is exactly equal to the expected reference value RPS_{L}_{=1,CL} (1/2), and the RPSS_{L}_{=1} is equal to zero. The analytical bias (Fig. 4) is zero, as expected. For higher orders of the norm the bias is even stronger. The example illustrates that the negative bias in the standard definition of the RPS_{L}_{=2} results from sampling errors in the forecast probabilities. It is primarily a consequence of the squared measure used to quantify the forecast error in the cumulated probability space. This discretization-squaring error also occurs for large ensemble sizes, but the negative bias is relatively small. However, for systems with small ensemble sizes, the bias can reach values comparable to the skill of the system and the score becomes meaningless.

_{L}_{=1}is a skill score that can be used for systems with small ensemble size, it is handicapped since it is not strictly proper (i.e., the forecasted probability values can be hedged toward values that are likely to give higher or equal scores). Strictly proper scores discourage forecast systems from hedging their forecasted probabilities toward probabilities that are likely to score higher (Jolliffe and Stephenson 2003). To illustrate that the RPS

*is not strictly proper, we calculate the expected score a forecast*

_{L}*y*would receive and find what forecast

_{i}*f*yields the best score. Let

_{i}*F*be the cumulative probability for the event being forecasted defined as

_{j}*F*

_{j}= Σ

^{j}

_{i=1}

*f*

_{i}. The expected RPS

*becomes (e.g., see Wilks 1995)*

_{L}*y*=

_{i}*f*. To find the value of

_{i}*y*that minimizes E(RPS

_{i}*), we take the partial derivative with respect to*

_{L}*y*(we assume

*f*to be constant), which gives

_{i}*L*= 1 Eq. (8) reduces to

_{L}_{=1}cannot be minimized. Thus the RPS

_{L}_{=1}and hence the RPSS

_{L}_{=1}are not strictly proper. For

*L*= 2 we get

*y*=

_{i}*f*, and hence the RPS

_{i}

_{L}_{=2}is strictly proper. For two classes, the RPS

_{L}_{=2}is a special case of the Brier score and has been shown to be strictly proper (e.g., see Mason 2004).

_{L}_{=2}becomes

*q*is the number of discrete resamples of the reference forecasts. To ensure that the climatology is fully represented,

*q*must be chosen large enough. Since the reference forecast now consists of an ensemble of the same size as the forecast ensemble, the possible probabilities and the relative frequency of occurrence take on the same discrete values as for the forecast system. Alternatively, Eq. (11) can be derived using binominal errors for the reference forecast, which would reduce computation and eliminate sampling variability resulting from the sampling strategy. This newly defined skill score (RPSS

*) is zero for any ensemble size and number of classes (Figs. 1c and 2c). Furthermore, the RPSS*

_{D}*is based on strictly proper scores. Therefore, the RPSS*

_{D}*provides an adequate strategy to compare ensemble systems with low numbers of ensemble size.*

_{D}## 4. Application to seasonal forecasts

In order to see the benefit of the debiased RPSS* _{D}* of a real application, the skill scores of the ECWMF Seasonal Forecast System 2 are calculated. This system is an operational, fully coupled atmosphere–ocean GCM and is described in detail by Anderson et al. (2003). The hindcast data analyzed here consist of 1-month lead forecasts (months 2 to 4) of the winter mean [December–February (DJF)] 2-m temperature (T2). A set of 40 ensemble members is available for each year of the hindcast period 1987–2001. For further analysis the forecasts are postprocessed by removing a lead-time-dependent mean model drift based on 15 years of hindcast climatology. The reference forecast is based on a Gaussian fit to the 40-yr ECMWF Re-Analysis (ERA40) for the same period 1987–2001. This dataset is also used to define the edges of the three probability classes used. For the forecast and reference forecast, no cross-validation is applied.

In Fig. 5a the gridpoint-based RPSS_{L}_{=2} is shown for the full set of 40 ensemble members for T2. The overall picture of the RPSS_{L}_{=2} is dominated by alternate patterns of strong positive and negative skill scores. Negative skill scores are found in large areas over Europe and positive skill scores over the northern Atlantic Ocean, in the northern part of Scandinavia and at the east coast of America. In Fig. 5b the skill scores are shown for the RPSS_{L}_{=1}. The figure shows the same regions of positive skill score over the Atlantic and northern Scandinavia as the RPSS_{L}_{=2}. However, over the continents regions of localized negative skill score disappear. The RPSS* _{D}* (Fig. 5c) shows results similar to the RPSS

_{L}_{=2}, which reflects that the RPSS

*is equivalent for a large ensemble size.*

_{D}The sensitivity to small numbers of ensemble members is shown in Fig. 6. Here 30 sets of forecasts are averaged, with each set consisting of five members randomly resampled from the 40 ensemble members. Due to the smaller ensemble size the confidence intervals are wider than for 40 ensemble members. The resulting mean of the subset of the RPSS_{L}_{=2} (Fig. 6a) shows distinct regions of positive skill scores. However, strong negative skill scores cover most of the entire region. Although these negative areas are not statistically significant, their magnitudes are comparable to the significant positive values. The RPSS_{L}_{=1} (Fig. 6b) does not show these regional negative areas in the RPSS_{L}_{=2}. The overall picture exhibits mainly significant positive areas in the Atlantic. The values of the RPSS_{L}_{=1} based on 5 members are of the same order of magnitude as the values based on 40 members (Fig. 5b), but the confidence intervals are wider. The RPSS* _{D}* for this set of ensemble members is illustrated in Fig. 6c. The RPSS

*is generally higher than the RPSS*

_{D}

_{L}_{=2}(Fig. 6a). The European and American continents, for which strong negative skill scores are found for the RPSS

_{L}_{=2}, are now mostly covered by skill scores in the range of ±0.10. These are indications of the strong negative bias (see section 3). However, single localized areas with strong negative skill scores still remain.

## 5. Signal-to-noise detection problem

The examined strategies to calculate the ranked probability skill score allow one to readdress the question of how many ensemble members are required for a forecast system to detect a climate anomaly of a known signal-to-noise ratio. In the white noise climate system the signal-to-noise ratio is described by the mean shift in standardized units of a hypothetical distribution of climate anomalies to the climatological distribution. The procedure corresponds to the one introduced by Kumar et al. (2001).

Figures 7a,c show the skill scores (RPSS_{L}_{=2} and RPSS* _{D}*) as a function of the signal-to-noise ratio. Skill scores for an ensemble system with 2, 5, 40, and 100 ensemble members are shown. For large climate mean shifts, the skill scores are independent of ensemble sizes. A clear difference is found for forecasts with small ensemble sizes and small climate shifts. As expected from the discussion above, the RPSS

_{L}_{=2}has a negative bias for small ensemble sizes. The RPSS

*shows no negative skill for weak anomalies and small ensemble sizes. The RPSS*

_{D}*is positive and irrespective of the ensemble size. This is also found for the RPSS*

_{D}

_{L}_{=1}(Fig. 7b).

An ensemble forecast system with a given ensemble size and a given number of forecasts still has a signal-to-noise detection limit. In our two strategies the signal-to-noise detection limit can be described by the confidence intervals discussed above. In Figs. 7b,c the vertical dashed lines indicate the 95% confidence interval for the RPSS_{L}_{=1} and the RPSS* _{D}* for ensemble sizes 5 and 40 and a number of 15 forecasts. For this hypothetical setup, comparable to the ECMWF hindcast system, a statistically significant skill score can only be estimated for a signal-to-noise ratio larger than ∼0.3 (40 members) and ∼0.6 (5 members), respectively. This value can be interpreted as the minimum ensemble size required to achieve a positive skill score for a given signal-to-noise ratio (Kumar et al. 2001). For a higher number of forecasts, the confidence intervals are closer to zero and hence a statistically significant skill score can be estimated for a smaller signal-to-noise ratio.

_{L}_{=2}and RPSS

*is undertaken. For two classes the RPS is the Brier score. Thus the RPS can be decomposed into a reliability term, a resolution term, and an uncertainty term (for details see Hersbach 2000; Wilks 1995) such as*

_{D}*is the relative frequency of the observations,*o

o

_{k}is the subsample relative frequency, and

*N*the number of times each forecast is used in the collection of the forecast being verified. The reliability score (first term) is a function of the squared difference between the forecast probability and the observed frequency in the different probability categories, while the resolution score (second term) is the average square difference between the observed frequency in each probability category and the mean frequency observed in the whole sample.

_{k}*A*describes the discretization and squaring error, which is introduced in the reference forecast of (11). For the RPSS

_{L}_{=2},

*A*is equal to zero. Splitting up (13), the reliability skill score (RELSS) can be written as

_{L}_{=2}and RELSS

*, respectively, are plotted as a function of the signal-to-noise ratio. The RELSS*

_{D}

_{L}_{=2}shows a strong dependence on the ensemble size. For a weak signal-to-noise ratio, the scores range from zero reliability to perfect reliability for 2-member and 100-member ensemble system, respectively. The RELSS

*is still dependent on ensemble size for weak anomalies but reveals substantially higher values.*

_{D}In Figs. 8b,e the results for the RESSS_{L}_{=2} and RESSS* _{D}* are shown. Whereas the RESSS

_{L=2}proves to be independent of the ensemble sizes, the RESSS

*is reduced for small ensemble sizes and strong anomalies. In Figs. 8c,f the UNCSS*

_{D}

_{L}_{=2}and UNCSS

*are shown for varying ensemble sizes. For*

_{D}*A*= 0, which is the case for the RPSS

_{L}_{=2}, the UNCSS

_{L}_{=2}is one for any ensemble size. The UNCSS

*, however, shows dependence on the ensemble size. For a five-member ensemble system the UNCSS*

_{D}*is reduced to a value of about 0.85. For the RPSS*

_{D}*the bias in the RELSS is compensated by introducing ensemble size dependence in the resolution and uncertainty term.*

_{D}## 6. Conclusions

In this study, the mechanics of the RPSS* _{L}* are studied in the context of forecast systems with small ensemble sizes and different norms

*L*. In agreement with earlier studies, it is shown that the standard calculation of the RPSS

_{L}_{=2}leads to a negative bias that can be even larger than the expected skill of the forecast system itself. This negative bias results from sampling errors in the forecast probabilities. It is a consequence of the squared measure used to quantify the forecast error in the cumulative probability space. It is particularly large for small ensemble sizes. For higher orders of

*L*the bias is further increased.

Two strategies are introduced that address the bias problem of the RPSS_{L}_{=2}. The first one is a proposed modified version of the RPS_{L}_{=2}, which is based on the absolute difference (RPS_{L}_{=1}) instead of the squared difference of the cumulative probabilities. This score is comparable to the mean absolute error (MAE). The expected skill score is independent of the ensemble size, whereas the confidence intervals are related to the ensemble size and the number of forecasts. However, the RPSS_{L}_{=1} is not strictly proper, which encourages forecasters from hedging their forecasted probabilities toward probabilities that are likely to score higher. The second score considered is the RPSS* _{D}*, which represents a debiased version of the standard RPSS

_{L}_{=2}. The proposed modification involves the resampling of the climatology as a reference forecast (RPSS

*). It is shown that this method is based on strictly proper scores and gives reasonable results even for systems with small ensemble sizes. A pure random climate forecast is used to show that the RPSS*

_{D}

_{L}_{=1}and RPSS

*provide an unbiased estimate of the skill score even for small ensemble sizes. For large ensemble sizes both skill measures are comparable. The random noise forecasts are also the base to determine the confidence intervals.*

_{D}To test the newly proposed scores, two examples are considered. First the operational ECMWF Seasonal Forecast System 2 is used to quantify the skill of near-surface winter mean temperature forecasts with the RPSS_{L}_{=1}, RPSS_{L}_{=2}, and RPSS* _{D}*. It is shown that the new skill scores yield increased values for the forecast system, in particular for small ensemble sizes. Furthermore, the gridpoint-based skill score structure is much more homogeneous, and the occurrence of scattered negative values, as is the case of the RPSS

_{L}_{=2}, is largely suppressed.

Second, the RPSS* _{D}* and the RPSS

_{L}_{=1}are used to find the minimum ensemble size required to predict a given climate signal. Here a white noise climate system is used in which the signal-to-noise ratio is described by the mean shift of a hypothetical distribution of climate anomalies to the climatological distribution in standardized units. By using a hypothetical setup comparable to the ECMWF hindcast system, statistically significant skill scores can be anticipated for climate signal-to-noise ratios larger than ∼0.3 (40 members) and ∼0.6 (5 members), respectively. With this methodology, the confidence intervals can be attached to the results from Kumar et al. (2001).

In the context of the signal-to-noise detection problem, a decomposition of the quadratic norms identifies the bias of the RPSS_{L}_{=2} as a reliability problem, whereas the resolution is unaffected by the ensemble size. A decomposition of the resampling strategy of the RPSS* _{D}* shows an improvement in the reliability for small ensemble sizes. However, this is at the expense of the resolution skill score, which is reduced. This seems more logical, as in this framework the resolution should also be affected by the ensemble size.

It is argued that for ensemble systems with small ensemble sizes, a debiased version of the RPSS_{L}_{=2} should be used to quantify the probabilistic skill of the system, either the RPSS_{L}_{=1} or the RPSS* _{D}*. Since the RPSS

_{L}_{=1}proves not to be strictly proper, we suggest a preferable use of the RPSS

*. However, other formulations could be considered that address the forecasts to be a random guess, as recently shown in a parallel study of Mason (2004).*

_{D}Finally, the use of these new skill scores is not restricted to seasonal forecast systems but is likely to be beneficial for other applications such as climate prediction or short-range limited-area ensemble prediction systems (Marsagli et al. 2001). Debiased skill scores are also desirable for comparison of multimodel ensemble systems with different ensemble sizes (Müller et al. 2004).

## Acknowledgments

This study was supported by the Swiss NSF through Grant 2100-061631.00 and the National Centre for Competence in Research Climate (NCCR-Climate). F. J. Doblas-Reyes received funding support through the DEMETER (EVK2-1999-00024) project. Thanks are expressed to Ch. Schär, I. T. Jolliffe, D. B. Stephenson, and the anonymous reviewers for their constructive comments and criticisms.

## REFERENCES

Anderson, D. L., and Coauthors, 2003: Comparison of the ECMWF seasonal forecast systems 1 and 2, including the relative performance for the 1997/8 El Niño. ECMWF Tech. Memo. 404, 93 pp.

Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction.

,*Mon. Wea. Rev***126****,**2503–2518.Buizza, R., T. Pertoliagis, T. N. Palmer, J. Barkmeijer, M. Hamrud, A. Hollingsworth, A. Simmons, and N. Wedi, 1998: Impact of model resolution and ensemble size on the performance of an ensemble prediction system.

,*Quart. J. Roy. Meteor. Soc***124****,**1935–1960.Dequé, M., 1997: Ensemble size for numerical seasonal forecasts.

,*Tellus***49A****,**74–86.Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories.

,*J. Appl. Meteor***8****,**985–987.Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems.

,*Wea. Forecasting***15****,**559–570.Jolliffe, I. T., and D. B. Stephenson, 2003:

*Forecast Verification: A Practitioner's Guide in Atmospheric Science*. Wiley, 240 pp.Kumar, A., A. G. Barnston, and M. P. Hoerling, 2001: Seasonal predictions, probabilistic verifications, and ensemble size.

,*J. Climate***14****,**1671–1676.Marsagli, C., A. Montani, F. Nerozzi, T. Paccagnella, S. Tibaldi, F. Molteni, and R. Buizza, 2001: A strategy for high resolution ensemble prediction II: Limited-area experiments in four Alpine flood events.

,*Quart. J. Roy. Meteor. Soc***127****,**2095–2115.Mason, I., 1982: A model for assessment of weather forecasts.

,*Aust. Meteor. Mag***30****,**291–303.Mason, S. J., 2004: On using climatology as a reference strategy in the Brier and ranked probability skill scores.

,*Mon. Wea. Rev***132****,**1891–1895.Müller, W. A., C. Appenzeller, and C. Schär, 2004: Probabilistic seasonal prediction of the winter North Atlantic Oscillation and its impact on near surface temperature.

, doi:10.1007/s00382-004-0492-z.*Climate Dyn*Murphy, A. H., 1969: On the ranked probability skill score.

,*J. Appl. Meteor***8****,**988–989.Murphy, A. H., 1971: A note on the ranked probability skills score.

,*J. Appl. Meteor***10****,**155–156.Nichollis, N., 2001: The insignificance of significance testing.

,*Bull. Amer. Meteor. Soc***81****,**981–986.Palmer, T. N., C. Brankovic, and D. S. Richardson, 2000: A probability and decision-model analysis of PROVOST seasonal multi-model ensemble integrations.

,*Quart. J. Roy. Meteor. Soc***126****,**2013–2033.Swets, J. A., 1973: The relative operating characteristic in psychology.

,*Science***182****,**990–1000.Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC. The generation of perturbations.

,*Bull. Amer. Meteor. Soc***74****,**2317–2330.Tracton, M. S., and E. Kalnay, 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects.

,*Wea. Forecasting***8****,**379–398.Unger, D. A., 1985: A method to estimate the continuous ranked probability score. Preprints,

*Ninth Conf. on Probability and Statistics in Atmospheric Science*, Virginia Beach, VA, Amer. Meteor. Soc., 206–213.Wilks, D. S., 1995:

*Statistical Methods in the Atmospheric Sciences*. Vol. 59, International Geophysics Series, Academic Press, 467 pp.

As in Fig. 1 but for upper and lower confidence interval (95%) as a function of equiprobable classes. Black (gray) lines illustrate set of 40 (5) ensemble members

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

As in Fig. 1 but for upper and lower confidence interval (95%) as a function of equiprobable classes. Black (gray) lines illustrate set of 40 (5) ensemble members

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

As in Fig. 1 but for upper and lower confidence interval (95%) as a function of equiprobable classes. Black (gray) lines illustrate set of 40 (5) ensemble members

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The RPSS_{L}_{=2} for 5 (gray) and 40 (black) ensemble members as a function of the number of forecasts. The thin lines denote upper and lower 95% confidence intervals for a 15-yr sample. Thick lines show the mean values

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The RPSS_{L}_{=2} for 5 (gray) and 40 (black) ensemble members as a function of the number of forecasts. The thin lines denote upper and lower 95% confidence intervals for a 15-yr sample. Thick lines show the mean values

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The RPSS_{L}_{=2} for 5 (gray) and 40 (black) ensemble members as a function of the number of forecasts. The thin lines denote upper and lower 95% confidence intervals for a 15-yr sample. Thick lines show the mean values

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The RPSS* _{L}* as a function of the ensemble size and different norms

*L*for white noise forecasts of two classes

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The RPSS* _{L}* as a function of the ensemble size and different norms

*L*for white noise forecasts of two classes

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The RPSS* _{L}* as a function of the ensemble size and different norms

*L*for white noise forecasts of two classes

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Gridpoint-based (a) RPSS_{L}_{=2}, (b) RPSS_{L}_{=1}, and (c) RPSS* _{D}* for winter mean (DJF) near-surface temperature forecasts based on the ECMWF Seasonal Forecast System 2. The forecasts are based on November initialization and cover the winters 1987/88–2001/02. An ensemble system with 40 members is used. The ±95% confidence intervals are denoted as thick plain (dotted) contours

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Gridpoint-based (a) RPSS_{L}_{=2}, (b) RPSS_{L}_{=1}, and (c) RPSS* _{D}* for winter mean (DJF) near-surface temperature forecasts based on the ECMWF Seasonal Forecast System 2. The forecasts are based on November initialization and cover the winters 1987/88–2001/02. An ensemble system with 40 members is used. The ±95% confidence intervals are denoted as thick plain (dotted) contours

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Gridpoint-based (a) RPSS_{L}_{=2}, (b) RPSS_{L}_{=1}, and (c) RPSS* _{D}* for winter mean (DJF) near-surface temperature forecasts based on the ECMWF Seasonal Forecast System 2. The forecasts are based on November initialization and cover the winters 1987/88–2001/02. An ensemble system with 40 members is used. The ±95% confidence intervals are denoted as thick plain (dotted) contours

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

As in Fig. 5 but for an average of 30 simulations based on a random subset of five ensemble members

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

As in Fig. 5 but for an average of 30 simulations based on a random subset of five ensemble members

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

As in Fig. 5 but for an average of 30 simulations based on a random subset of five ensemble members

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The (a) RPSS_{L}_{=2}, (b) RPSS_{L}_{=1}, and (c) RPSS* _{D}* as a function of a standardized anomaly (signal-to-noise ratios; see text for details). Numbers denote ensemble size 2, 5, 40, and 100. The scores are based on three equiprobable classes. The dashed lines in (b) and (c) indicate the one-tailed 95% confidence limits and corresponding standardized anomaly for 40 and 5 members, respectively

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The (a) RPSS_{L}_{=2}, (b) RPSS_{L}_{=1}, and (c) RPSS* _{D}* as a function of a standardized anomaly (signal-to-noise ratios; see text for details). Numbers denote ensemble size 2, 5, 40, and 100. The scores are based on three equiprobable classes. The dashed lines in (b) and (c) indicate the one-tailed 95% confidence limits and corresponding standardized anomaly for 40 and 5 members, respectively

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The (a) RPSS_{L}_{=2}, (b) RPSS_{L}_{=1}, and (c) RPSS* _{D}* as a function of a standardized anomaly (signal-to-noise ratios; see text for details). Numbers denote ensemble size 2, 5, 40, and 100. The scores are based on three equiprobable classes. The dashed lines in (b) and (c) indicate the one-tailed 95% confidence limits and corresponding standardized anomaly for 40 and 5 members, respectively

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The (a) RELSS_{L}_{=2}, (b) RESSS_{L}_{=2}, (c) UNCSS_{L}_{=2}, (d) RELSS* _{D}*, (e) RESSS

*, and (f) UNCSS*

_{D}*as a function of a standardized anomaly. Numbers denote ensemble size 2, 5, 40, and 100*

_{D}Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The (a) RELSS_{L}_{=2}, (b) RESSS_{L}_{=2}, (c) UNCSS_{L}_{=2}, (d) RELSS* _{D}*, (e) RESSS

*, and (f) UNCSS*

_{D}*as a function of a standardized anomaly. Numbers denote ensemble size 2, 5, 40, and 100*

_{D}Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

The (a) RELSS_{L}_{=2}, (b) RESSS_{L}_{=2}, (c) UNCSS_{L}_{=2}, (d) RELSS* _{D}*, (e) RESSS

*, and (f) UNCSS*

_{D}*as a function of a standardized anomaly. Numbers denote ensemble size 2, 5, 40, and 100*

_{D}Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1