A Debiased Ranked Probability Skill Score to Evaluate Probabilistic Ensemble Forecasts with Small Ensemble Sizes

W. A. Müller Swiss Federal Office of Meteorology and Climatology (MeteoSwiss), Zürich, Switzerland

Search for other papers by W. A. Müller in
Current site
Google Scholar
PubMed
Close
,
C. Appenzeller Swiss Federal Office of Meteorology and Climatology (MeteoSwiss), Zürich, Switzerland

Search for other papers by C. Appenzeller in
Current site
Google Scholar
PubMed
Close
,
F. J. Doblas-Reyes European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom

Search for other papers by F. J. Doblas-Reyes in
Current site
Google Scholar
PubMed
Close
, and
M. A. Liniger Swiss Federal Office of Meteorology and Climatology (MeteoSwiss), Zürich, Switzerland

Search for other papers by M. A. Liniger in
Current site
Google Scholar
PubMed
Close
Full access

Abstract

The ranked probability skill score (RPSS) is a widely used measure to quantify the skill of ensemble forecasts. The underlying score is defined by the quadratic norm and is comparable to the mean squared error (mse) but it is applied in probability space. It is sensitive to the shape and the shift of the predicted probability distributions. However, the RPSS shows a negative bias for ensemble systems with small ensemble size, as recently shown. Here, two strategies are explored to tackle this flaw of the RPSS. First, the RPSS is examined for different norms L (RPSSL). It is shown that the RPSSL=1 based on the absolute rather than the squared difference between forecasted and observed cumulative probability distribution is unbiased; RPSSL defined with higher-order norms show a negative bias. However, the RPSSL=1 is not strictly proper in a statistical sense. A second approach is then investigated, which is based on the quadratic norm but with sampling errors in climatological probabilities considered in the reference forecasts. This technique is based on strictly proper scores and results in an unbiased skill score, which is denoted as the debiased ranked probability skill score (RPSSD) hereafter. Both newly defined skill scores are independent of the ensemble size, whereas the associated confidence intervals are a function of the ensemble size and the number of forecasts.

The RPSSL=1 and the RPSSD are then applied to the winter mean [December–January–February (DJF)] near-surface temperature predictions of the ECMWF Seasonal Forecast System 2. The overall structures of the RPSSL=1 and the RPSSD are more consistent and largely independent of the ensemble size, unlike the RPSSL=2. Furthermore, the minimum ensemble size required to predict a climate anomaly given a known signal-to-noise ratio is determined by employing the new skill scores. For a hypothetical setup comparable to the ECMWF hindcast system (40 members and 15 hindcast years), statistically significant skill scores were only found for a signal-to-noise ratio larger than ∼0.3.

Corresponding author address: Dr. W. A. Müller, Kirchgasse 49, D-79291 Merdingen, Germany. Email: climate@gmx.de

Abstract

The ranked probability skill score (RPSS) is a widely used measure to quantify the skill of ensemble forecasts. The underlying score is defined by the quadratic norm and is comparable to the mean squared error (mse) but it is applied in probability space. It is sensitive to the shape and the shift of the predicted probability distributions. However, the RPSS shows a negative bias for ensemble systems with small ensemble size, as recently shown. Here, two strategies are explored to tackle this flaw of the RPSS. First, the RPSS is examined for different norms L (RPSSL). It is shown that the RPSSL=1 based on the absolute rather than the squared difference between forecasted and observed cumulative probability distribution is unbiased; RPSSL defined with higher-order norms show a negative bias. However, the RPSSL=1 is not strictly proper in a statistical sense. A second approach is then investigated, which is based on the quadratic norm but with sampling errors in climatological probabilities considered in the reference forecasts. This technique is based on strictly proper scores and results in an unbiased skill score, which is denoted as the debiased ranked probability skill score (RPSSD) hereafter. Both newly defined skill scores are independent of the ensemble size, whereas the associated confidence intervals are a function of the ensemble size and the number of forecasts.

The RPSSL=1 and the RPSSD are then applied to the winter mean [December–January–February (DJF)] near-surface temperature predictions of the ECMWF Seasonal Forecast System 2. The overall structures of the RPSSL=1 and the RPSSD are more consistent and largely independent of the ensemble size, unlike the RPSSL=2. Furthermore, the minimum ensemble size required to predict a climate anomaly given a known signal-to-noise ratio is determined by employing the new skill scores. For a hypothetical setup comparable to the ECMWF hindcast system (40 members and 15 hindcast years), statistically significant skill scores were only found for a signal-to-noise ratio larger than ∼0.3.

Corresponding author address: Dr. W. A. Müller, Kirchgasse 49, D-79291 Merdingen, Germany. Email: climate@gmx.de

1. Introduction

In recent years probabilistic ensemble forecast systems have been established in a wide area of applications. The probabilistic nature of these forecasts requires verification techniques based on probabilistic skill measures. However, there is no general agreement on the best skill score. The choice depends on the particular application considered or the forecast system being used. Examples are Brier scores (BSs) or the relative operating characteristics (for details see Swets 1973; Mason 1982; Jolliffe and Stephenson 2003; Wilks 1995). The Brier score, for instance, is essentially the mean squared error (mse) of the probability forecast of a dichotomous event. An example is a probability forecast of the winter mean temperature to be above or below the climatological mean (Palmer et al. 2000). For a range of applications, such a dichotomous score gives an incomplete picture since the entire shape of the probability function is not considered. A multicategory score that measures the shape as well as the central tendency of the whole probability density function (PDF) is more eligible. An often used score for such applications is the ranked probability score (RPS; Epstein 1969; Murphy 1969; Murphy 1971).

The RPS is based on the cumulative density function (CDF) and is classically defined by the quadratic norm, hereafter denoted as RPSL=2. The score is the integrated squared difference between the forecasted and the observed CDF. It can be seen as the probabilistic extension of the mse. However, the RPSL=2 is applied in the cumulative probability space and not in the physical space, that is, the integration is taken over categories. It can be interpreted as an extension of the Brier score for finite ordered categories. The extension to an infinite number of classes results in the continuous RPS (Unger 1985; Hersbach 2000)

Current ensemble prediction systems (EPSs) for medium-range forecasts (3 to 10 days) typically use ensemble sizes varying from 17 [National Centers for Environmental Prediction (NCEP)] to 50 [European Centre for Medium-Range Weather Forecasts (ECMWF)] members to construct the probability density function (Toth and Kalnay 1993; Tracton and Kalnay 1993; Buizza et al. 1998). For long-range forecasts, with prediction times of months to years, the ensemble size is usually smaller. The reason for having small ensemble sizes lies primarily in computational costs, which is particularly true for hindcast experiments that are used for verification and calibration. For example, the 15-yr hindcast database of the ECMWF Seasonal Forecast System 2 mostly consists of five ensemble members.

Such a small ensemble size (and number of forecasts) can lead to statistical problems stemming from large sampling errors. In the frame of numerical weather prediction Buizza and Palmer (1998) have shown that the forecast skill can be improved by increasing the ensemble size, but the extent to which improvement occurs depends on the measure used. For the RPSSL=2, they found major skill improvement for at least up to eight members. In a perfect seasonal forecast approach, Dequé (1997) determined the ensemble size required for skill score saturation for various parameters. An ensemble size of 40 was suggested for European temperature forecasts. Kumar et al. (2001) have also explored the influence of the ensemble size on the RPSSL=2 and noted that the RPSSL=2 is strongly negatively biased for a small ensemble size. Based on this bias the minimum ensemble size required to predict a climate signal given a known signal-to-noise ratio (mean shift of the anomaly distribution in standardized units) was derived.

Here we show that the substantial negative bias of the RPSSL=2 for small ensemble sizes is primarily a consequence of the discretization and squaring measure in its formulation. Two strategies are introduced that overcome these deficiencies (section 2). The characteristics of the suggested modifications of the RPSSL=2 are examined with a synthetic example (section 3). The new techniques are then applied to a real seasonal winter temperature forecast based on the ECMWF Seasonal Forecast System 2 for the years 1987–2001 (section 4) and to the climate signal-to-noise detection problem (section 5). Finally, a conclusion and discussion are given in the last section.

2. Definition of RPSSL

Similar to the mean squared error (mse), the RPSL=2 is a quadratic measure, and thus larger deviations from the actual probability are penalized much stronger than smaller ones. The RPSL for any norm L is defined as
i1520-0442-18-10-1513-eq1
where N is the number of forecasts, k is the forecast index, and
i1520-0442-18-10-1513-e1
Here J denotes the number of the classes, and L is the norm. The cumulative probabilities of the forecasts Yj and the observations Oj are defined as Yj = Σji=1yi and Oj = Σji=1oi, where yi and oi are the probability of the forecast and observation, respectively, for the class i. The RPSL is zero in case of a perfect forecast and positive otherwise. The calculation of the skill score is based on the comparison of the forecast score (RPSL,FC) to a reference forecast score (RPSL,CL).
Thus the RPSSL becomes
i1520-0442-18-10-1513-e2
Any positive value of the RPSSL indicates a forecast benefit compared to the reference forecast. For L = 2 the skill score becomes the standard squared definition (Wilks 1995), whereas L = 1 gives the absolute skill score RPSSL=1. In this study, the distribution of the reference forecast is defined by a Gaussian fit to the observations. The equiprobable classes are obtained from the distribution of the reference forecasts.

3. A synthetic example

In the following a white noise climate system is employed in order to explore the sensitivity of the RPSSL=2 to the ensemble size. Forecasts and observations are chosen to be random Gaussian time series. In such a system the expected outcome should be a value that gives no benefit compared to a reference forecast. The white noise climate consists of a sample of 100 000 cases. For each skill score, 300 ensembles are randomly chosen and verified against randomly chosen observations. This gives a robust estimate of the skill score. This procedure is repeated 100 times for randomly chosen observations. This gives 100 skill scores from which the mean and the 95% confidence intervals are calculated. These confidence intervals are chosen as guidance for further local testing of the real seasonal forecasts against random time series. Other significance tests (e.g., field significance) are available, and a more complete discussion is found in Wilks (1995) and Nichollis (2001). In the light of a potential application to the ECMWF forecast system, the skill scores are calculated with up to 40 ensemble members and a number of 15-yr forecasts. Three equiprobable classes (above, normal, and below) are used.

Figure 1a shows the dependence of the RPSSL=2 on the ensemble size for white noise climate forecasts. The mean of the RPSSL=2 exhibits negative skills ranging from –0.20 up to −0.02 for an ensemble system of size 5 and 40, respectively. These values are below the expected value of zero benefit. With larger ensemble size the bias decreases slowly toward zero. The 95% confidence intervals (thin lines) are asymmetric and vary from about −0.45/0.05 for a 5-member system to about −0.10/0.05 for a size of 40. The confidence intervals as a function of the number of equiprobable classes are shown in Fig. 2a. The confidence intervals of a 40-member forecast system (black) are much closer to the mean values than for a 5-member system (gray). A strong asymmetry around zero is visible for a small ensemble size. The magnitudes of the confidence intervals are largest for two classes. Finally, the bias and confidence intervals are evaluated as a function of the number of forecasts (Fig. 3). The confidence intervals are closer to the mean for a higher number of forecasts. However, the bias remains unchanged, indicating that it is related to the ensemble size.

The corresponding results for the RPSSL=1 are shown in Figs. 1b and 2b. The mean of the RPSSL=1 (Fig. 1b) is located close to zero even for the smallest ensemble size, for example, for a five-member system. The 95% confidence intervals are spread symmetrically around the zero mean with a magnitude of about ±0.14. For a large ensemble size (40) the spread is about ±0.05. The confidence intervals are now symmetric around zero for all classes (Fig. 2b) and only depend on the ensemble size, with reduced spread for larger samples. Obviously, the RPSSL=1 has no bias for small ensemble sizes.

To clarify the origin of these differences, a step by step calculation is performed for the RPSL=2 and RPSL=1 with white noise climate forecast. Suppose a forecast for two equiprobable classes (in this case, the RPSL reduces to the BS). Since the cumulative probabilities for the forecast and observation in the second class are always both equal to one, the RPSL=2,k value reduces to
i1520-0442-18-10-1513-e3
Similarly, the RPSL=1 reduces to
i1520-0442-18-10-1513-e4
A climatological reference forecast predicts a probability of 1/2 for the event to be below the median. For such a forecast the difference to the observations in the cumulative probability space is always 1/2, independent of whether the observations were above or below the mean value. Hence the RPSL=2,CL is 1/4, whereas the RPSL=1,CL is 1/2.
In practice, an ensemble system has only a finite ensemble size. Thus, the predicted cumulative probabilities Y1 can only take on a set of discrete values. These probabilities are (0, 1/m, 2/m, . . . , (m − 1)/m, 1) for an ensemble system with m members. For a three-member ensemble system these values are explicitly (0, 1/3, 2/3, 1). The mean RPSL=2,FC is then (1/8)1 + (3/8)(2/3)2 + (3/8)(1/3)2 + (1/8)0 = (1/3), where the forecast probabilities are weighted by the relative frequency of the occurrence. Evidently, the squaring of the cumulative measures gives an RPSL=2,FC that is larger than the expected reference value RPSL=2, CL of 1/4. As a consequence, the RPSSL=2 takes a negative value. In a general formulation of the bias of a white noise climate forecast, each possible cumulative probability value is weighted by the probability of occurrence that is given by the binominal distribution. Thus, for a two-class system the scores of the unskilled ensemble forecast is given by
i1520-0442-18-10-1513-e5
whereas for such a system the score for the reference forecast is
i1520-0442-18-10-1513-e6

For L = 2, the RPSL=2,FC is always larger than RPSL=2,CL. Figure 4 shows the bias of the RPSSL for different ensemble sizes and norms. The analytical bias is almost equal to the bias in the white noise climate forecast (Fig. 1a). Finally, note that for more than two classes the relative frequency of occurrence does not linearly increase and Eqs. (5) and (6) increase in complexity. But the bias remains also for forecast systems that have ensemble sizes that are an integer multiple of the categories (not shown).

For L = 1 the RPSL=1,FC of a three-member ensemble system gives (1/8)1 + (3/8)(2/3) + (3/8)(1/3) + (1/8)0 = 1/2. The outcome of RPSL=1,FC is exactly equal to the expected reference value RPSL=1,CL (1/2), and the RPSSL=1 is equal to zero. The analytical bias (Fig. 4) is zero, as expected. For higher orders of the norm the bias is even stronger. The example illustrates that the negative bias in the standard definition of the RPSL=2 results from sampling errors in the forecast probabilities. It is primarily a consequence of the squared measure used to quantify the forecast error in the cumulated probability space. This discretization-squaring error also occurs for large ensemble sizes, but the negative bias is relatively small. However, for systems with small ensemble sizes, the bias can reach values comparable to the skill of the system and the score becomes meaningless.

Although the RPSSL=1 is a skill score that can be used for systems with small ensemble size, it is handicapped since it is not strictly proper (i.e., the forecasted probability values can be hedged toward values that are likely to give higher or equal scores). Strictly proper scores discourage forecast systems from hedging their forecasted probabilities toward probabilities that are likely to score higher (Jolliffe and Stephenson 2003). To illustrate that the RPSL is not strictly proper, we calculate the expected score a forecast yi would receive and find what forecast fi yields the best score. Let Fj be the cumulative probability for the event being forecasted defined as Fj = Σji=1fi. The expected RPSL becomes (e.g., see Wilks 1995)
i1520-0442-18-10-1513-e7
If the score is strictly proper, then the score will be minimized if yi = fi. To find the value of yi that minimizes E(RPSL), we take the partial derivative with respect to y (we assume fi to be constant), which gives
i1520-0442-18-10-1513-e8
For L = 1 Eq. (8) reduces to
i1520-0442-18-10-1513-e9
The partial derivative is never zero, and the RPSL=1 cannot be minimized. Thus the RPSL=1 and hence the RPSSL=1 are not strictly proper. For L = 2 we get
i1520-0442-18-10-1513-e10
which is zero for yi = fi, and hence the RPSL=2 is strictly proper. For two classes, the RPSL=2 is a special case of the Brier score and has been shown to be strictly proper (e.g., see Mason 2004).
To base the RPSS on a strictly proper score, the negative bias needs to be removed without changing the norm of the score. A straightforward way is to introduce a discretization and squaring error in the reference forecast artificially. To do so, we calculate the score of the reference forecast in Eq. (2) with random resampling of the climatology, where the resample size is equal to the ensemble size. In this case the debiased RPSSL=2 becomes
i1520-0442-18-10-1513-e11
where q is the number of discrete resamples of the reference forecasts. To ensure that the climatology is fully represented, q must be chosen large enough. Since the reference forecast now consists of an ensemble of the same size as the forecast ensemble, the possible probabilities and the relative frequency of occurrence take on the same discrete values as for the forecast system. Alternatively, Eq. (11) can be derived using binominal errors for the reference forecast, which would reduce computation and eliminate sampling variability resulting from the sampling strategy. This newly defined skill score (RPSSD) is zero for any ensemble size and number of classes (Figs. 1c and 2c). Furthermore, the RPSSD is based on strictly proper scores. Therefore, the RPSSD provides an adequate strategy to compare ensemble systems with low numbers of ensemble size.

4. Application to seasonal forecasts

In order to see the benefit of the debiased RPSSD of a real application, the skill scores of the ECWMF Seasonal Forecast System 2 are calculated. This system is an operational, fully coupled atmosphere–ocean GCM and is described in detail by Anderson et al. (2003). The hindcast data analyzed here consist of 1-month lead forecasts (months 2 to 4) of the winter mean [December–February (DJF)] 2-m temperature (T2). A set of 40 ensemble members is available for each year of the hindcast period 1987–2001. For further analysis the forecasts are postprocessed by removing a lead-time-dependent mean model drift based on 15 years of hindcast climatology. The reference forecast is based on a Gaussian fit to the 40-yr ECMWF Re-Analysis (ERA40) for the same period 1987–2001. This dataset is also used to define the edges of the three probability classes used. For the forecast and reference forecast, no cross-validation is applied.

In Fig. 5a the gridpoint-based RPSSL=2 is shown for the full set of 40 ensemble members for T2. The overall picture of the RPSSL=2 is dominated by alternate patterns of strong positive and negative skill scores. Negative skill scores are found in large areas over Europe and positive skill scores over the northern Atlantic Ocean, in the northern part of Scandinavia and at the east coast of America. In Fig. 5b the skill scores are shown for the RPSSL=1. The figure shows the same regions of positive skill score over the Atlantic and northern Scandinavia as the RPSSL=2. However, over the continents regions of localized negative skill score disappear. The RPSSD (Fig. 5c) shows results similar to the RPSSL=2, which reflects that the RPSSD is equivalent for a large ensemble size.

The sensitivity to small numbers of ensemble members is shown in Fig. 6. Here 30 sets of forecasts are averaged, with each set consisting of five members randomly resampled from the 40 ensemble members. Due to the smaller ensemble size the confidence intervals are wider than for 40 ensemble members. The resulting mean of the subset of the RPSSL=2 (Fig. 6a) shows distinct regions of positive skill scores. However, strong negative skill scores cover most of the entire region. Although these negative areas are not statistically significant, their magnitudes are comparable to the significant positive values. The RPSSL=1 (Fig. 6b) does not show these regional negative areas in the RPSSL=2. The overall picture exhibits mainly significant positive areas in the Atlantic. The values of the RPSSL=1 based on 5 members are of the same order of magnitude as the values based on 40 members (Fig. 5b), but the confidence intervals are wider. The RPSSD for this set of ensemble members is illustrated in Fig. 6c. The RPSSD is generally higher than the RPSSL=2 (Fig. 6a). The European and American continents, for which strong negative skill scores are found for the RPSSL=2, are now mostly covered by skill scores in the range of ±0.10. These are indications of the strong negative bias (see section 3). However, single localized areas with strong negative skill scores still remain.

5. Signal-to-noise detection problem

The examined strategies to calculate the ranked probability skill score allow one to readdress the question of how many ensemble members are required for a forecast system to detect a climate anomaly of a known signal-to-noise ratio. In the white noise climate system the signal-to-noise ratio is described by the mean shift in standardized units of a hypothetical distribution of climate anomalies to the climatological distribution. The procedure corresponds to the one introduced by Kumar et al. (2001).

Figures 7a,c show the skill scores (RPSSL=2 and RPSSD) as a function of the signal-to-noise ratio. Skill scores for an ensemble system with 2, 5, 40, and 100 ensemble members are shown. For large climate mean shifts, the skill scores are independent of ensemble sizes. A clear difference is found for forecasts with small ensemble sizes and small climate shifts. As expected from the discussion above, the RPSSL=2 has a negative bias for small ensemble sizes. The RPSSD shows no negative skill for weak anomalies and small ensemble sizes. The RPSSD is positive and irrespective of the ensemble size. This is also found for the RPSSL=1 (Fig. 7b).

An ensemble forecast system with a given ensemble size and a given number of forecasts still has a signal-to-noise detection limit. In our two strategies the signal-to-noise detection limit can be described by the confidence intervals discussed above. In Figs. 7b,c the vertical dashed lines indicate the 95% confidence interval for the RPSSL=1 and the RPSSD for ensemble sizes 5 and 40 and a number of 15 forecasts. For this hypothetical setup, comparable to the ECMWF hindcast system, a statistically significant skill score can only be estimated for a signal-to-noise ratio larger than ∼0.3 (40 members) and ∼0.6 (5 members), respectively. This value can be interpreted as the minimum ensemble size required to achieve a positive skill score for a given signal-to-noise ratio (Kumar et al. 2001). For a higher number of forecasts, the confidence intervals are closer to zero and hence a statistically significant skill score can be estimated for a smaller signal-to-noise ratio.

To further investigate the origin of the negative bias, a decomposition of the RPSSL=2 and RPSSD is undertaken. For two classes the RPS is the Brier score. Thus the RPS can be decomposed into a reliability term, a resolution term, and an uncertainty term (for details see Hersbach 2000; Wilks 1995) such as
i1520-0442-18-10-1513-e12
where o is the relative frequency of the observations, ok is the subsample relative frequency, and Nk the number of times each forecast is used in the collection of the forecast being verified. The reliability score (first term) is a function of the squared difference between the forecast probability and the observed frequency in the different probability categories, while the resolution score (second term) is the average square difference between the observed frequency in each probability category and the mean frequency observed in the whole sample.
The RPS for a climatological forecast is equal to the uncertainty term since the resolution term and the reliability term are both zero. Hence using a climatological forecast as reference forecast the RPSS can be written as
i1520-0442-18-10-1513-e13
Here A describes the discretization and squaring error, which is introduced in the reference forecast of (11). For the RPSSL=2, A is equal to zero. Splitting up (13), the reliability skill score (RELSS) can be written as
i1520-0442-18-10-1513-e14
and the resolution skill score (RESSS) and uncertainty skill score (UNCSS) as
i1520-0442-18-10-1513-e15
i1520-0442-18-10-1513-e16
In Figs. 8a,d the RELSSL=2 and RELSSD, respectively, are plotted as a function of the signal-to-noise ratio. The RELSSL=2 shows a strong dependence on the ensemble size. For a weak signal-to-noise ratio, the scores range from zero reliability to perfect reliability for 2-member and 100-member ensemble system, respectively. The RELSSD is still dependent on ensemble size for weak anomalies but reveals substantially higher values.

In Figs. 8b,e the results for the RESSSL=2 and RESSSD are shown. Whereas the RESSSL=2 proves to be independent of the ensemble sizes, the RESSSD is reduced for small ensemble sizes and strong anomalies. In Figs. 8c,f the UNCSSL=2 and UNCSSD are shown for varying ensemble sizes. For A = 0, which is the case for the RPSSL=2, the UNCSSL=2 is one for any ensemble size. The UNCSSD, however, shows dependence on the ensemble size. For a five-member ensemble system the UNCSSD is reduced to a value of about 0.85. For the RPSSD the bias in the RELSS is compensated by introducing ensemble size dependence in the resolution and uncertainty term.

6. Conclusions

In this study, the mechanics of the RPSSL are studied in the context of forecast systems with small ensemble sizes and different norms L. In agreement with earlier studies, it is shown that the standard calculation of the RPSSL=2 leads to a negative bias that can be even larger than the expected skill of the forecast system itself. This negative bias results from sampling errors in the forecast probabilities. It is a consequence of the squared measure used to quantify the forecast error in the cumulative probability space. It is particularly large for small ensemble sizes. For higher orders of L the bias is further increased.

Two strategies are introduced that address the bias problem of the RPSSL=2. The first one is a proposed modified version of the RPSL=2, which is based on the absolute difference (RPSL=1) instead of the squared difference of the cumulative probabilities. This score is comparable to the mean absolute error (MAE). The expected skill score is independent of the ensemble size, whereas the confidence intervals are related to the ensemble size and the number of forecasts. However, the RPSSL=1 is not strictly proper, which encourages forecasters from hedging their forecasted probabilities toward probabilities that are likely to score higher. The second score considered is the RPSSD, which represents a debiased version of the standard RPSSL=2. The proposed modification involves the resampling of the climatology as a reference forecast (RPSSD). It is shown that this method is based on strictly proper scores and gives reasonable results even for systems with small ensemble sizes. A pure random climate forecast is used to show that the RPSSL=1 and RPSSD provide an unbiased estimate of the skill score even for small ensemble sizes. For large ensemble sizes both skill measures are comparable. The random noise forecasts are also the base to determine the confidence intervals.

To test the newly proposed scores, two examples are considered. First the operational ECMWF Seasonal Forecast System 2 is used to quantify the skill of near-surface winter mean temperature forecasts with the RPSSL=1, RPSSL=2, and RPSSD. It is shown that the new skill scores yield increased values for the forecast system, in particular for small ensemble sizes. Furthermore, the gridpoint-based skill score structure is much more homogeneous, and the occurrence of scattered negative values, as is the case of the RPSSL=2, is largely suppressed.

Second, the RPSSD and the RPSSL=1 are used to find the minimum ensemble size required to predict a given climate signal. Here a white noise climate system is used in which the signal-to-noise ratio is described by the mean shift of a hypothetical distribution of climate anomalies to the climatological distribution in standardized units. By using a hypothetical setup comparable to the ECMWF hindcast system, statistically significant skill scores can be anticipated for climate signal-to-noise ratios larger than ∼0.3 (40 members) and ∼0.6 (5 members), respectively. With this methodology, the confidence intervals can be attached to the results from Kumar et al. (2001).

In the context of the signal-to-noise detection problem, a decomposition of the quadratic norms identifies the bias of the RPSSL=2 as a reliability problem, whereas the resolution is unaffected by the ensemble size. A decomposition of the resampling strategy of the RPSSD shows an improvement in the reliability for small ensemble sizes. However, this is at the expense of the resolution skill score, which is reduced. This seems more logical, as in this framework the resolution should also be affected by the ensemble size.

It is argued that for ensemble systems with small ensemble sizes, a debiased version of the RPSSL=2 should be used to quantify the probabilistic skill of the system, either the RPSSL=1 or the RPSSD. Since the RPSSL=1 proves not to be strictly proper, we suggest a preferable use of the RPSSD. However, other formulations could be considered that address the forecasts to be a random guess, as recently shown in a parallel study of Mason (2004).

Finally, the use of these new skill scores is not restricted to seasonal forecast systems but is likely to be beneficial for other applications such as climate prediction or short-range limited-area ensemble prediction systems (Marsagli et al. 2001). Debiased skill scores are also desirable for comparison of multimodel ensemble systems with different ensemble sizes (Müller et al. 2004).

Acknowledgments

This study was supported by the Swiss NSF through Grant 2100-061631.00 and the National Centre for Competence in Research Climate (NCCR-Climate). F. J. Doblas-Reyes received funding support through the DEMETER (EVK2-1999-00024) project. Thanks are expressed to Ch. Schär, I. T. Jolliffe, D. B. Stephenson, and the anonymous reviewers for their constructive comments and criticisms.

REFERENCES

  • Anderson, D. L., and Coauthors, 2003: Comparison of the ECMWF seasonal forecast systems 1 and 2, including the relative performance for the 1997/8 El Niño. ECMWF Tech. Memo. 404, 93 pp.

  • Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev, 126 , 25032518.

  • Buizza, R., T. Pertoliagis, T. N. Palmer, J. Barkmeijer, M. Hamrud, A. Hollingsworth, A. Simmons, and N. Wedi, 1998: Impact of model resolution and ensemble size on the performance of an ensemble prediction system. Quart. J. Roy. Meteor. Soc, 124 , 19351960.

    • Search Google Scholar
    • Export Citation
  • Dequé, M., 1997: Ensemble size for numerical seasonal forecasts. Tellus, 49A , 7486.

  • Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor, 8 , 985987.

  • Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15 , 559570.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner's Guide in Atmospheric Science. Wiley, 240 pp.

  • Kumar, A., A. G. Barnston, and M. P. Hoerling, 2001: Seasonal predictions, probabilistic verifications, and ensemble size. J. Climate, 14 , 16711676.

    • Search Google Scholar
    • Export Citation
  • Marsagli, C., A. Montani, F. Nerozzi, T. Paccagnella, S. Tibaldi, F. Molteni, and R. Buizza, 2001: A strategy for high resolution ensemble prediction II: Limited-area experiments in four Alpine flood events. Quart. J. Roy. Meteor. Soc, 127 , 20952115.

    • Search Google Scholar
    • Export Citation
  • Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag, 30 , 291303.

  • Mason, S. J., 2004: On using climatology as a reference strategy in the Brier and ranked probability skill scores. Mon. Wea. Rev, 132 , 18911895.

    • Search Google Scholar
    • Export Citation
  • Müller, W. A., C. Appenzeller, and C. Schär, 2004: Probabilistic seasonal prediction of the winter North Atlantic Oscillation and its impact on near surface temperature. Climate Dyn, doi:10.1007/s00382-004-0492-z.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1969: On the ranked probability skill score. J. Appl. Meteor, 8 , 988989.

  • Murphy, A. H., 1971: A note on the ranked probability skills score. J. Appl. Meteor, 10 , 155156.

  • Nichollis, N., 2001: The insignificance of significance testing. Bull. Amer. Meteor. Soc, 81 , 981986.

  • Palmer, T. N., C. Brankovic, and D. S. Richardson, 2000: A probability and decision-model analysis of PROVOST seasonal multi-model ensemble integrations. Quart. J. Roy. Meteor. Soc, 126 , 20132033.

    • Search Google Scholar
    • Export Citation
  • Swets, J. A., 1973: The relative operating characteristic in psychology. Science, 182 , 9901000.

  • Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC. The generation of perturbations. Bull. Amer. Meteor. Soc, 74 , 23172330.

  • Tracton, M. S., and E. Kalnay, 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects. Wea. Forecasting, 8 , 379398.

    • Search Google Scholar
    • Export Citation
  • Unger, D. A., 1985: A method to estimate the continuous ranked probability score. Preprints, Ninth Conf. on Probability and Statistics in Atmospheric Science, Virginia Beach, VA, Amer. Meteor. Soc., 206–213.

  • Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Vol. 59, International Geophysics Series, Academic Press, 467 pp.

Fig. 1.
Fig. 1.

The (a) RPSSL=2, (b) RPSSL=1, and (c) RPSSD for white noise climate forecasts as a function of the ensemble size. The thin lines denote upper and lower 95% confidence intervals for a 15-yr sample. Thick lines show the mean values. Three equiprobable classes are used

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Fig. 2.
Fig. 2.

As in Fig. 1 but for upper and lower confidence interval (95%) as a function of equiprobable classes. Black (gray) lines illustrate set of 40 (5) ensemble members

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Fig. 3.
Fig. 3.

The RPSSL=2 for 5 (gray) and 40 (black) ensemble members as a function of the number of forecasts. The thin lines denote upper and lower 95% confidence intervals for a 15-yr sample. Thick lines show the mean values

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Fig. 4.
Fig. 4.

The RPSSL as a function of the ensemble size and different norms L for white noise forecasts of two classes

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Fig. 5.
Fig. 5.

Gridpoint-based (a) RPSSL=2, (b) RPSSL=1, and (c) RPSSD for winter mean (DJF) near-surface temperature forecasts based on the ECMWF Seasonal Forecast System 2. The forecasts are based on November initialization and cover the winters 1987/88–2001/02. An ensemble system with 40 members is used. The ±95% confidence intervals are denoted as thick plain (dotted) contours

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Fig. 6.
Fig. 6.

As in Fig. 5 but for an average of 30 simulations based on a random subset of five ensemble members

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Fig. 7.
Fig. 7.

The (a) RPSSL=2, (b) RPSSL=1, and (c) RPSSD as a function of a standardized anomaly (signal-to-noise ratios; see text for details). Numbers denote ensemble size 2, 5, 40, and 100. The scores are based on three equiprobable classes. The dashed lines in (b) and (c) indicate the one-tailed 95% confidence limits and corresponding standardized anomaly for 40 and 5 members, respectively

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Fig. 8.
Fig. 8.

The (a) RELSSL=2, (b) RESSSL=2, (c) UNCSSL=2, (d) RELSSD, (e) RESSSD, and (f) UNCSSD as a function of a standardized anomaly. Numbers denote ensemble size 2, 5, 40, and 100

Citation: Journal of Climate 18, 10; 10.1175/JCLI3361.1

Save
  • Anderson, D. L., and Coauthors, 2003: Comparison of the ECMWF seasonal forecast systems 1 and 2, including the relative performance for the 1997/8 El Niño. ECMWF Tech. Memo. 404, 93 pp.

  • Buizza, R., and T. N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea. Rev, 126 , 25032518.

  • Buizza, R., T. Pertoliagis, T. N. Palmer, J. Barkmeijer, M. Hamrud, A. Hollingsworth, A. Simmons, and N. Wedi, 1998: Impact of model resolution and ensemble size on the performance of an ensemble prediction system. Quart. J. Roy. Meteor. Soc, 124 , 19351960.

    • Search Google Scholar
    • Export Citation
  • Dequé, M., 1997: Ensemble size for numerical seasonal forecasts. Tellus, 49A , 7486.

  • Epstein, E. S., 1969: A scoring system for probability forecasts of ranked categories. J. Appl. Meteor, 8 , 985987.

  • Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15 , 559570.

    • Search Google Scholar
    • Export Citation
  • Jolliffe, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner's Guide in Atmospheric Science. Wiley, 240 pp.

  • Kumar, A., A. G. Barnston, and M. P. Hoerling, 2001: Seasonal predictions, probabilistic verifications, and ensemble size. J. Climate, 14 , 16711676.

    • Search Google Scholar
    • Export Citation
  • Marsagli, C., A. Montani, F. Nerozzi, T. Paccagnella, S. Tibaldi, F. Molteni, and R. Buizza, 2001: A strategy for high resolution ensemble prediction II: Limited-area experiments in four Alpine flood events. Quart. J. Roy. Meteor. Soc, 127 , 20952115.

    • Search Google Scholar
    • Export Citation
  • Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag, 30 , 291303.

  • Mason, S. J., 2004: On using climatology as a reference strategy in the Brier and ranked probability skill scores. Mon. Wea. Rev, 132 , 18911895.

    • Search Google Scholar
    • Export Citation
  • Müller, W. A., C. Appenzeller, and C. Schär, 2004: Probabilistic seasonal prediction of the winter North Atlantic Oscillation and its impact on near surface temperature. Climate Dyn, doi:10.1007/s00382-004-0492-z.

    • Search Google Scholar
    • Export Citation
  • Murphy, A. H., 1969: On the ranked probability skill score. J. Appl. Meteor, 8 , 988989.

  • Murphy, A. H., 1971: A note on the ranked probability skills score. J. Appl. Meteor, 10 , 155156.

  • Nichollis, N., 2001: The insignificance of significance testing. Bull. Amer. Meteor. Soc, 81 , 981986.

  • Palmer, T. N., C. Brankovic, and D. S. Richardson, 2000: A probability and decision-model analysis of PROVOST seasonal multi-model ensemble integrations. Quart. J. Roy. Meteor. Soc, 126 , 20132033.

    • Search Google Scholar
    • Export Citation
  • Swets, J. A., 1973: The relative operating characteristic in psychology. Science, 182 , 9901000.

  • Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC. The generation of perturbations. Bull. Amer. Meteor. Soc, 74 , 23172330.

  • Tracton, M. S., and E. Kalnay, 1993: Operational ensemble prediction at the National Meteorological Center: Practical aspects. Wea. Forecasting, 8 , 379398.

    • Search Google Scholar
    • Export Citation
  • Unger, D. A., 1985: A method to estimate the continuous ranked probability score. Preprints, Ninth Conf. on Probability and Statistics in Atmospheric Science, Virginia Beach, VA, Amer. Meteor. Soc., 206–213.

  • Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Vol. 59, International Geophysics Series, Academic Press, 467 pp.

  • Fig. 1.

    The (a) RPSSL=2, (b) RPSSL=1, and (c) RPSSD for white noise climate forecasts as a function of the ensemble size. The thin lines denote upper and lower 95% confidence intervals for a 15-yr sample. Thick lines show the mean values. Three equiprobable classes are used

  • Fig. 2.

    As in Fig. 1 but for upper and lower confidence interval (95%) as a function of equiprobable classes. Black (gray) lines illustrate set of 40 (5) ensemble members

  • Fig. 3.

    The RPSSL=2 for 5 (gray) and 40 (black) ensemble members as a function of the number of forecasts. The thin lines denote upper and lower 95% confidence intervals for a 15-yr sample. Thick lines show the mean values

  • Fig. 4.

    The RPSSL as a function of the ensemble size and different norms L for white noise forecasts of two classes

  • Fig. 5.

    Gridpoint-based (a) RPSSL=2, (b) RPSSL=1, and (c) RPSSD for winter mean (DJF) near-surface temperature forecasts based on the ECMWF Seasonal Forecast System 2. The forecasts are based on November initialization and cover the winters 1987/88–2001/02. An ensemble system with 40 members is used. The ±95% confidence intervals are denoted as thick plain (dotted) contours

  • Fig. 6.

    As in Fig. 5 but for an average of 30 simulations based on a random subset of five ensemble members

  • Fig. 7.

    The (a) RPSSL=2, (b) RPSSL=1, and (c) RPSSD as a function of a standardized anomaly (signal-to-noise ratios; see text for details). Numbers denote ensemble size 2, 5, 40, and 100. The scores are based on three equiprobable classes. The dashed lines in (b) and (c) indicate the one-tailed 95% confidence limits and corresponding standardized anomaly for 40 and 5 members, respectively

  • Fig. 8.

    The (a) RELSSL=2, (b) RESSSL=2, (c) UNCSSL=2, (d) RELSSD, (e) RESSSD, and (f) UNCSSD as a function of a standardized anomaly. Numbers denote ensemble size 2, 5, 40, and 100

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 807 211 12
PDF Downloads 485 131 12