## 1. Introduction

The ranked probability score (RPS) is the sum of the squared differences between cumulative forecast probabilities and cumulative observed probabilities, and measures both forecast reliability and resolution (Murphy 1973). The ranked probability skill score (RPSS) compares the RPS of a forecast with some reference forecast such as “climatology” (using past mean climatic values as the forecast), oriented so that RPSS < 0 (RPSS > 0) corresponds to a forecast that is less (more) skillful than climatology.

Categorical forecast probabilities are often estimated from ensembles of numerical model integrations by counting the number of ensemble members in each category. Finite ensemble size introduces sampling error into such probability estimates, and the RPSS of a reliable forecast model with finite ensemble size is an increasing function of ensemble size (Kumar et al. 2001; Tippett et al. 2007). A similar relation exists between correlation and ensemble size (Sardeshmukh et al. 2000). The dependence of RPSS on ensemble size makes it challenging to use RPSS to compare forecast models with different ensemble sizes. For instance, it may be difficult to know whether a forecast system has higher RPSS because it is based on a superior forecast model or because it uses a larger ensemble. This question often arises in the comparison of multimodel and single model forecasts (Hagedorn et al. 2005; Tippett and Barnston 2008). The dependence of RPSS on ensemble size is not a problem when comparing forecast quality. Improved RPSS is associated with improved forecast quality and is desirable whether it results from larger ensemble size or from a better forecast model.

Müller et al. (2005) recently introduced a resampling strategy to estimate the infinite-ensemble RPSS from the finite-ensemble RPSS and called this estimate the “debiased RPSS.” Weigel et al. (2007) derived an analytical formula for the debiased RPSS and proved that it is an unbiased estimate of the infinite-ensemble RPSS in the case of uncorrelated ensemble members, that is, forecasts without skill. Here it is proved that the debiased RPSS is an unbiased estimate of the infinite-ensemble RPSS for any reliable forecasts. It is shown that over- or underconfident forecasts introduce a dependence of the debiased RPSS on ensemble size. Simplification of the results of Weigel et al. (2007) shows that the debiased RPSS is a multicategory generalization of the result of Richardson (2001) for the Brier skill score.

## 2. RPSS and debiased RPSS

*K*-category probability forecast is

*P*is the forecast probability assigned to the

_{i}*i*th category and

*O*is 1 when the observation falls into the

_{i}*i*th category and 0 otherwise. When forecast probabilities are computed by counting the number of ensemble members in each category, finite ensemble size results in sampling errors that increase RPS.

*M*in a reliable forecast system. Tippett et al. (2007) generalized that result to tercile categories and later (Tippett and Barnston 2008) to an arbitrary number of categories as

_{Cl}is the RPS of a reference forecast consisting of climatological probabilities and angle brackets denote averaging over forecasts. Sampling error causes RPSS to decrease. Using Eq. (2), the infinite-ensemble RPSS can be expressed in terms of the finite-ensemble RPSS as

*M*) was to artificially increase the error in the reference forecast by computing climatological probabilities using the same number of samples as ensemble members and then to define a debiased RPSS, denoted RPSS

*, by*

_{D}*had little if any dependence on ensemble size.*

_{D}*is the same as RPSS(∞) and is indeed an unbiased estimate for the infinite-ensemble RPSS for all reliable forecasts since*

_{D}_{Cl}(

*M*)〉 was computed by repeatedly sampling from the historical record. Weigel et al. (2007) computed 〈RPS

_{Cl}(

*M*)〉 analytically using properties of the multinomial distribution and expressed RPSS

*as*

_{D}*p*is the climatological probability of the

_{i}*i*th category. In light of Eq. (4), it must be the case that

*D*is simplified. From Eq. (12) of Weigel et al. (2007),

*p̂*

_{i}is the

*M*-member sample estimate of

*p*. Since the

_{i}*M*-member sample estimates of the cumulative probabilities are binomially distributed, their means are

*C*and their variances are

_{i}*C*(1 −

_{i}*C*)/

_{i}*M*, where the cumulative climatological probability

*C*is defined by

_{i}*D*has the simple form

_{Cl}〉 is expressed in terms of the climatological categorical probabilities

*p*Explicitly, 〈RPS

_{i}._{Cl}〉 is

_{Cl}is simply Eq. (13) summed over all possible outcomes of the observations, weighted by the probabilities of each outcome. That is,

*δ*is defined to be 1 when

_{ij}*i*=

*j*and 0 otherwise. Direct manipulation of this expression gives

## 3. Unreliable forecasts

However, the results above do not give any guidance about the dependence of RPSS on ensemble size when the forecasts are unreliable. Ferro et al. (2008) derive a more general estimator for RPSS that is applicable to under- and overconfident ensembles, as long as the ensemble members are “exchangeable.” Although Müller et al. (2005) states that RPSS* _{D}* is an unbiased estimate of the infinite-ensemble RPSS and is independent of ensemble size, there was no explicit examination of the behavior of the RPSS

*for unreliable forecasts. The behavior of RPSS*

_{D}*is investigated here in an example in which the forecasts are unreliable.*

_{D}A simple univariate example is considered here in which the forecasts and observations are normally distributed. The expected correlation between the ensemble mean and observations is *r*, and the expected correlation between the ensemble mean and an ensemble member is *r _{f}*;

*r*measures potential predictability, that is, the ability of the forecast model to predict itself. Explicitly, the observations are normally distributed with mean

_{f}*rs*and variance 1 −

*r*

^{2}, denoted

*N*(

*rs*, 1 −

*r*

^{2}), and the forecast distribution is

*N*(

*r*

_{f}

*s*, 1 −

*r*

^{2}

_{f}); the distribution of the random variable

*s*is

*N*(0, 1). The forecast is reliable when

*r*=

_{f}*r*and overconfident (underconfident) when

*r*>

_{f}*r*(

*r*<

_{f}*r*).

Values of *r* and *r _{f}* were chosen corresponding to reliable, weakly overconfident, very overconfident, weakly underconfident, and very underconfident forecast systems, as indicated in Table 1. The expected values of RPSS(

*M*) and RPSS

*for tercile-based categorical forecasts were computed from 10*

_{D}^{6}simulations of the observations and forecast ensembles. Figure 1 shows the results as a function of ensemble size

*M.*Figure 1a shows that RPSS

*is, as proved, an unbiased estimate of RPSS(∞) independent of ensemble size. Figures 1b and 1c show that for overconfident forecasts RPSS*

_{D}*overestimates RPSS(∞), with the discrepancy between RPSS*

_{D}*and RPSS(∞) being greater than that between RPSS(*

_{D}*M*) and RPSS(∞) for very overconfident forecasts. There is some indication of the tendency of RPSS

*to overestimate RPSS(∞) in Figs. 3a and 3b of Weigel et al. (2007), indicating model overconfidence. In the underconfident examples, RPSS*

_{D}*slightly underestimates RPSS(∞).*

_{D}## 4. Summary

The ranked probability skill score measures the reliability and resolution of categorical probability forecasts relative to the climatology forecast (Murphy 1973). When categorical forecast probabilities are estimated from finite ensembles, sampling error negatively impacts RPSS (Kumar et al. 2001; Tippett et al. 2007). Weigel et al. (2007) recently derived an analytical formula for the debiased RPSS, an estimate of the infinite-ensemble RPSS in terms of the finite-ensemble RPSS, based on the resampling strategy of Müller et al. (2005). Here it has been proved that the debiased RPSS is an unbiased estimate of the infinite-ensemble RPSS for reliable forecasts only. Over- or underconfident forecasts introduce dependence of the debiased RPSS on ensemble size. Analysis of the results of Weigel et al. (2007) shows that the debiased RPSS is a multicategory generalization of the Brier skill score result of Richardson (2001).

## Acknowledgments

The author thanks Simon Mason, Andreas Weigel, and Tony Barnston for their comments and suggestions. The author is supported by a grant/cooperative agreement from the National Oceanic and Atmospheric Administration (NA05OAR4311004). The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA or any of its subagencies.

## REFERENCES

Ferro, C. A. T., D. S. Richardson, and A. P. Weigel, 2008: On the effect of ensemble size on the discrete and continuous ranked probability scores.

,*Meteor. Appl.***15****,**19–25.Hagedorn, R., F. J. Doblas-Reyes, and T. N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting—I. Basic concept.

,*Tellus***57A****,**219–233.Kumar, A., A. G. Barnston, and M. P. Hoerling, 2001: Seasonal predictions, probabilistic verifications, and ensemble size.

,*J. Climate***14****,**1671–1676.Müller, W. A., C. Appenzeller, F. J. Doblas-Reyes, and M. A. Liniger, 2005: A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes.

,*J. Climate***18****,**1513–1523.Murphy, A. H., 1973: A new vector partition of the probability score.

,*J. Appl. Meteor.***12****,**595–600.Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size.

,*Quart. J. Roy. Meteor. Soc.***127****,**2473–2489.Sardeshmukh, P. D., G. P. Compo, and C. Penland, 2000: Changes of probability associated with El Niño.

,*J. Climate***13****,**4268–4286.Tippett, M. K., and A. G. Barnston, 2008: Skill of multimodel ENSO probability forecasts.

, in press.*Mon. Wea. Rev.*Tippett, M. K., A. G. Barnston, and A. W. Robertson, 2007: Estimation of seasonal precipitation tercile-based categorical probabilities from ensembles.

,*J. Climate***20****,**2210–2228.Weigel, A. P., M. A. Liniger, and C. Appenzeller, 2007: The discrete Brier and ranked probability skill scores.

,*Mon. Wea. Rev.***135****,**118–124.

Values of *r* and *r _{f}* used in the numerical experiments.