1. Introduction
Probabilistic forecasts with ensemble prediction systems (EPSs) have found a wide range of applications in weather and climate risk management, and their importance grows continuously. For example, the European Centre for Medium-Range Weather Forecasts (ECMWF) meanwhile operationally applies a 51-member EPS for medium-range weather predictions (e.g., Buizza et al. 2005) and a 40-member system for seasonal climate forecasts (Anderson et al. 2003). The rationale behind the ensemble method used is to approximate the expected probability density function (PDF) of a quantity by a finite set of forecast realizations. While the predicted probabilities account for the intrinsic uncertainties of atmospheric and ocean evolution, it is not trivial to verify them such that the full information content is considered. In particular, a predicted probability cannot be verified by a single observation (Candille and Talagrand 2005).
Several scores have been developed to quantify the performance of probabilistic prediction systems, for example, the rank histogram (Anderson 1996) and the relative operating characteristic (ROC; Mason 1982). The focus of this paper is on two other widely used scores, namely the Brier score and the ranked probability score (BS and RPS, respectively). The RPS (Epstein 1969; Murphy 1969, 1971) is a squared measure that compares the cumulative density function (CDF) of a probabilistic forecast with the CDF of the corresponding observation over a given number of discrete probability categories. The BS (Brier 1950) is a special case of an RPS with two categories.
The RPS can be formulated as a skill score (RPSS) by relating the score of the forecasts to the score of a reference forecast. For short-range climate predictions, “climatology” is often used as a reference forecast, that is, a forecast strategy that is based on the climatological probabilities of the forecast categories. The RPSS is a favorable skill score in that it considers both the shape and overall tendency of the forecast PDF. It is sensitive to distance, that is, a forecast is increasingly penalized the more its CDF differs from the actual outcome (Wilks 1995). Finally, the RPSS is a strictly proper skill score, meaning that it cannot be increased by hedging the probabilistic forecasts toward other values against the forecaster’s true belief. A major flaw of the RPSS is its strong dependence on ensemble size (noticed, e.g., by Buizza and Palmer 1998), at least for ensembles with less than 40 members (Déque 1997). Indeed, the RPSS is negatively biased (Richardson 2001; Kumar et al. 2001; Mason 2004). Müller et al. (2005) investigated the reasons for this negative bias of the RPSS, and they suggested that it arises from sampling errors (properties inherent to the discretization and squaring measure used in the RPSS definition). As a strategy to overcome this deficiency, Müller et al. (2005) proposed to artificially introduce the same sampling error on the reference score. Using this “compensation of errors” approach, they defined a new “debiased” ranked probability skill score (RPSSD) that is independent of ensemble size while retaining the favorable properties, in particular the strict propriety, of the RPSS. However, a deficiency of their approach is that the bias correction is determined quasi empirically by frequent random resampling from climatology.
In this paper, we overcome this deficiency and introduce an improved analytical version of the RPSSD that does not require a resampling procedure. A simple formula is presented directly relating RPSS and RPSSD and thus allowing for an analytical straightforward correction of the ensemble size dependent bias. The equation can be applied for any ensemble size and any choice of categories. It is derived and discussed in section 2. In section 3, the performance of the new RPSSD formulation is demonstrated in a synthetic and a real case example. The last section provides concluding remarks.
2. The RPSS bias correction
a. Definition of RPS and RPSS








b. The discrete ranked probability skill score RPSSD






















c. The discrete Brier skill score BSSD












3. Examples
In this section, the advantages of the new RPSSD formulation for discrete probability forecasts are illustrated with two examples.
a. RPSSD applied to synthetic white noise forecasts
A Gaussian white noise climate is used to explore the performance of the derived analytical bias correction for different ensemble sizes. Similar to the procedure of Müller et al. (2005), 15 forecast ensembles together with 15 observations are randomly chosen from this synthetic “climate” and are classified into three equiprobable categories. From these, the 〈RPS〉 and 〈RPSCl〉 values are calculated. Using Eqs. (3) and (14), the RPSS and RPSSD (in its new formulation) are determined. This procedure is repeated 10 000 times for ensemble sizes ranging from 1 to 50 members, yielding 10 000 RPSS and RPSSD values each. From these, the means are obtained as plotted in Fig. 1 (solid lines) in the function of different ensemble sizes. While the RPSS reveals a negative bias that becomes stronger with decreasing ensemble size (ranging from −0.02 for a 50-member ensemble to −0.50 for a 2-member ensemble), the RPSSD remains constantly near zero for all ensemble sizes; that is, the negative bias is removed. These results are equivalent to those shown by Müller et al. (2005, their Fig. 1), who determined the RPSSD with Eq. (5), that is, by frequent resampling.
In an alternative approach, we fit Gaussian distributions to the randomly chosen ensembles from above and calculate the RPSS values on the basis of these distributions, rather than using the raw ensemble output. This method is considered to yield more accurate estimates of the forecast quantiles, provided that the parent distribution is Gaussian (Wilks 2002). The negative bias of the RPSS can indeed be slightly reduced, though not removed, by such a smoothing procedure (dashed line in Fig. 1). This is consistent with the notion of increasing the effective ensemble size by such a fit (Wilks 2002). In the terminology of section 2c, the intrinsic unreliability of an EPS can be reduced, but not eliminated, by smoothing the forecast ensembles with an appropriate distribution.
Müller et al. (2005) pointed out that despite the bias elimination provided by the RPSSD, there remains a signal-to-noise detection limit, or in other words, there is always a chance that a positive skill score is due to coincidence and that the predictions are, in fact, not any better than the reference. This is the null hypothesis one wants to reject. Therefore, it is necessary to know the confidence intervals of the RPSSD distribution that the very EPS under consideration would produce in the case of merely random climatological forecasts. An estimation of this distribution can be obtained by the procedure described above, that is, by explicitly sampling a large number of RPSSD values from a white noise climate for a given number of ensemble members and forecast–observation pairs. In Fig. 2, the 95% quantiles of such RPSSD distributions are displayed. Obviously, the confidence threshold decreases both with growing ensemble size and the number of forecast–observation pairs. Thus, while the expectation
b. BSSD applied to real forecast data
In a second example, the correction formula for the RPSS [Eq. (13)] is applied to the problem of quantifying the gain in prediction skill due to increasing ensemble size—a topic that is also relevant for comparing single versus multimodel approaches in seasonal forecasting (e.g., Palmer et al. 2004). Here, the widely used Brier skill score is employed with two equiprobable categories. Forecasts of mean near-surface (2 m) temperature for March with 4-months lead time are used as hindcast data. They are obtained from the ECMWF System 2 seasonal prediction model with 40 ensemble members (Anderson et al. 2003) and are verified gridpoint-wise against the corresponding “observations” from the 40-yr ECMWF Re-Analysis (ERA-40) dataset (Uppala et al. 2005). The threshold separating the two probability categories is determined from the 15 yr of hindcast and observation data, respectively. Three regions are considered: (i) the Niño-3.4 region (5°S–5°N, 120°–170°W) with enhanced seasonal predictability (e.g., Latif et al. 1998), (ii) a region over southern Africa (0°–40°S, 10°–50°E) with “intermediate” seasonal predictability (Klopper et al. 1998), and (iii) a region over central Europe (40°–55°N, 10°–30°E) where seasonal predictability is considered low (e.g., Müller et al. 2004). For these three regions, both BSS [Eq. (4)] and BSSD [Eq. (17)] are determined by spatial and temporal averaging over all BS and BSCl scores obtained at each grid point for each hindcast year. This procedure is then repeated for smaller ensemble sizes (down to 2 members), which are constructed by randomly resampling from the 40 ensemble members available in the ECMWF forecasts. The results are shown in Fig. 3, where BSS and BSSD are plotted against ensemble size for each of the regions. Consistent with what has been shown for the white noise climate above, in all three cases, the value of the classical BSS increases with ensemble size. This is also consistent with earlier studies (see, e.g., Palmer et al. 2004, their Fig. 5). On the other hand, with the new formulation of the Brier skill score, the negative bias disappears for all three geographical regions considered. Figure 3 suggests that the value of the BSSD becomes independent of ensemble size for any prediction skill. Strictly speaking, in section 2, this has only been proven for uncorrelated ensemble members. Note that in this example, the calculation of the 95% confidence intervals is not trivial because spatial averages over statistically dependent grid points are involved and thus the number of independent forecast–observation pairs is not known. This problem could be overcome, for example, by estimating the effective number of spatial degrees of freedom, as suggested by Bretherton et al. (1999).
4. Conclusions
This study has addressed the negative bias of the widely used Brier and ranked probability skill scores (BSS and RPSS, respectively). A simple analytical formula has been derived quantifying the bias dependent upon category probabilities and ensemble size. Using this expression, an easy-to-implement formulation for a debiased discrete ranked probability skill score (RPSSD) has been obtained. Formally, the RPSSD is calculated by adding a correction term D to the “classical” reference score of the RPSS. This correction term is indirectly proportional to the number of ensemble members and can therefore be neglected for systems with large ensemble size.
The performance of the new RPSSD formulation has been illustrated in two examples. They show that the expectation of the RPSSD for a given forecast strategy is indeed independent of ensemble size but not of its variance. This means that increasing the number of ensemble members does not increase prediction skill, per se; rather, the statistical significance of the skill score is enhanced.
The actual reason for the negative bias of the RPSS is the fact that ensemble prediction systems (EPSs) are inherently unreliable, that is, have a positive reliability term. Their “intrinsic unreliability” grows as ensemble size decreases. Given that reliability deficits can be corrected by calibration a posteriori, it may be misleading to base a skill score on the comparison of two prediction systems with different ensemble size and thus different intrinsic unreliability. This is exactly what happens in the case of the conventional RPSS, because its reference score implies—in the RPSSD view—an infinitely large ensemble size [such that D is zero in Eq. (14)]. In other words, the RPSS could be regarded as ill conditioned because it compares the score of an intrinsically unreliable forecast system of finite ensemble size with a reference score of a forecast system of infinite ensemble size and perfect reliability. This problem is resolved and the skill score debiased by adding the EPS’s intrinsic unreliability, D, to the climatological reference.
In summary, the RPSSD provides a powerful and easily implemented tool for the evaluation of probabilistic forecasts with small ensembles and for the comparison of different EPSs of different ensemble size. The bias correction is particularly important when multimodel ensembles are to be evaluated, where the benefit of increasing ensemble size needs to be clearly distinguished from the benefits of multimodel combination.
Acknowledgments
Thanks are expressed to Wolfgang Müller for his helpful comments on this manuscript. This study was supported by the Swiss National Science Foundation through the National Centre for Competence in Research (NCCR Climate) and by ENSEMBLES EC Contract GOCE-CT-2003-505539.
REFERENCES
Anderson, D. L. T., and Coauthors, 2003: Comparison of the ECMWF seasonal forecast systems 1 and 2, including the relative performance for the 1997/8 El Niño. ECMWF Tech. Memo. 404, 93 pp.
Anderson, J. S., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integration. J. Climate, 9 , 1518–1530.
Bretherton, C. S., M. Widmann, V. P. Dymnikov, J. M. Wallace, and I. Bladé, 1999: The effective number of spatial degrees of freedom of a time-varying field. J. Climate, 12 , 1990–2009.
Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78 , 1–3.
Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev., 126 , 2508–2518.
Buizza, R., P. L. Houtekamer, Z. Toth, G. Pellerin, M. Wei, and Y. Zhu, 2005: A comparison of the ECMWF, MSC, and NCEP global ensemble prediction systems. Mon. Wea. Rev., 133 , 1076–1097.
Candille, G., and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a scalar variable. Quart. J. Roy. Meteor. Soc., 131 , 2131–2150.
Déque, M., 1997: Ensemble size for numerical weather forecasts. Tellus, 49A , 74–86.
Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor., 8 , 985–987.
Klopper, E., W. A. Landmann, and J. van Heerden, 1998: The predictability of seasonal maximum temperature in South Africa. Int. J. Climatol., 18 , 741–758.
Kumar, A., A. G. Barnston, and M. R. Hoerling, 2001: Seasonal predictions, probabilistic verifications, and ensemble size. J. Climate, 14 , 1671–1676.
Latif, M., D. Anderson, M. Cane, R. Kleeman, A. Leetmaa, J. O’Brien, A. Rosati, and E. Schneider, 1998: A review of the predictability and prediction of ENSO. J. Geophys. Res., 103 , 14375–14393.
Linder, A., 1960: Statistische Methoden. 3d ed. Birkhäuser Verlag, 484 pp.
Mason, I. B., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30 , 291–303.
Mason, S. J., 2004: On using “climatology” as a reference strategy in the Brier and ranked probability skill scores. Mon. Wea. Rev., 132 , 1891–1895.
Müller, W. A., C. Appenzeller, and C. Schär, 2004: Probabilistic seasonal prediction of the winter North Atlantic Oscillation and its impact on near surface temperature. Climate Dyn., 24 , 213–226.
Müller, W. A., C. Appenzeller, F. J. Doblas-Reyes, and M. A. Liniger, 2005: A de-biased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. J. Climate, 18 , 1513–1523.
Murphy, A. H., 1969: On the ranked probability skill score. J. Appl. Meteor., 8 , 988–989.
Murphy, A. H., 1971: A note on the ranked probability skill score. J. Appl. Meteor., 10 , 155–156.
Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12 , 595–600.
Palmer, T. N., and Coauthors, 2004: Development of a European multimodel ensemble system for seasonal-to-interannual prediction (DEMETER). Bull. Amer. Meteor. Soc., 85 , 853–872.
Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quart. J. Roy. Meteor. Soc., 127 , 2473–2489.
Toth, Z., O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. Forecast Verification–A Practitioner’s Guide in Atmospheric Science, I. T. Joliffe and D. B. Stephenson, Eds., John Wiley & Sons, 137–163.
Uppala, S. M., and Coauthors, 2005: The ERA-40 re-analysis. Quart. J. Roy. Meteor. Soc., 131 , 2961–3012.
Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. International Geophysics Series, Vol. 59, Academic Press, 467 pp.
Wilks, D. S., 2002: Smoothing forecast ensembles with fitted probability distributions. Quart. J. Roy. Meteor. Soc., 128 , 2821–2836.

Expectation
Citation: Monthly Weather Review 135, 1; 10.1175/MWR3280.1

Expectation
Citation: Monthly Weather Review 135, 1; 10.1175/MWR3280.1
Expectation
Citation: Monthly Weather Review 135, 1; 10.1175/MWR3280.1

Upper 95% confidence levels (contour lines) for the debiased ranked probability skill score (RPSSD) for random white noise forecasts as a function of ensemble size and number of forecast–observation pairs. Three equiprobable forecast categories are used. The confidence levels are estimated from 10 000 RPSSD values.
Citation: Monthly Weather Review 135, 1; 10.1175/MWR3280.1

Upper 95% confidence levels (contour lines) for the debiased ranked probability skill score (RPSSD) for random white noise forecasts as a function of ensemble size and number of forecast–observation pairs. Three equiprobable forecast categories are used. The confidence levels are estimated from 10 000 RPSSD values.
Citation: Monthly Weather Review 135, 1; 10.1175/MWR3280.1
Upper 95% confidence levels (contour lines) for the debiased ranked probability skill score (RPSSD) for random white noise forecasts as a function of ensemble size and number of forecast–observation pairs. Three equiprobable forecast categories are used. The confidence levels are estimated from 10 000 RPSSD values.
Citation: Monthly Weather Review 135, 1; 10.1175/MWR3280.1

Brier skill score (BSS) and debiased Brier skill score (BSSD) as a function of ensemble size for near-surface temperature predictions for March with a lead time of 4 months. Data are averaged over 15 yr (1988–2002) over (a) the Niño-3.4 region, (b) southern Africa, and (c) central Europe. Two equiprobable classes are used, that is, the events are “temperature below normal” and “temperature above normal.” The forecasts are obtained from the ECMWF Seasonal Forecast System 2 and the verifying observations are from ERA-40 data.
Citation: Monthly Weather Review 135, 1; 10.1175/MWR3280.1

Brier skill score (BSS) and debiased Brier skill score (BSSD) as a function of ensemble size for near-surface temperature predictions for March with a lead time of 4 months. Data are averaged over 15 yr (1988–2002) over (a) the Niño-3.4 region, (b) southern Africa, and (c) central Europe. Two equiprobable classes are used, that is, the events are “temperature below normal” and “temperature above normal.” The forecasts are obtained from the ECMWF Seasonal Forecast System 2 and the verifying observations are from ERA-40 data.
Citation: Monthly Weather Review 135, 1; 10.1175/MWR3280.1
Brier skill score (BSS) and debiased Brier skill score (BSSD) as a function of ensemble size for near-surface temperature predictions for March with a lead time of 4 months. Data are averaged over 15 yr (1988–2002) over (a) the Niño-3.4 region, (b) southern Africa, and (c) central Europe. Two equiprobable classes are used, that is, the events are “temperature below normal” and “temperature above normal.” The forecasts are obtained from the ECMWF Seasonal Forecast System 2 and the verifying observations are from ERA-40 data.
Citation: Monthly Weather Review 135, 1; 10.1175/MWR3280.1